Client Background

Client: A leading Real Estate firm in the EU

Industry Type:  Real Estate

Services: Real Estate

Organization Size: 1000+

Project Objective

The objective of this project is to build a data warehouse from a website given search and filter criteria.

Project Description

The objective of this project is to collect data from a website given search and filter criteria.

Data Brief:

  1. Crawl all the information for the property adverts once a week and store them in a database. 
  2. Data language: English

Filters:

  1. Federal States

Contains a list of the federal states in Germany to Crawl:

https://en.wikipedia.org/wiki/States_of_Germany

  1. Categories to Crawl
  • Mieten Wohnung

  • Kaufen Wohnung

  • Kaufen Anlageobjekte
  • Kaufen Grundstück

Our Solution

We have developed a Python tool that crawls and scrapes all the apartment listings for all the states in Germany under each category namely: mieten wohnungen, kaufen wohnungen, kaufen anlageobjekte and kaufen grundstuck. The Scrapy library has been used to crawl and scrape. Beautiful soup could have also been used for the scraping purpose, but for the sake of consistency, Scrapy has been used for both purposes.

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Four Spiders have been created for each category to be scraped. Every spider crawls all the states in Germany and scrapes all the apartment listings for important data. Every spider creates a separate JSON file to store all its data. This data is then converted to CSV using another python script called “conversion”.

The python tool has been completely automated and only needs the “Controller” script to be run. The script also has the capability of running every two weeks automatically. 

Project Deliverables

Four CSV files (one for each category):

Mieten Wohnungen.csv

Kaufen Wohnungen.csv

Kaufen Anlageobjekte.csv

Kaufen Grundstuck.csv

Language/techniques used

  • Python
  • Web Crawling & Scraping

Skills used

  • Data Scraping
  • Data Crawling
  • Advanced Python programming

Project Snapshots