Building a scalable web scraper for a large number of different websites

The goal of the project is to build a scalable web scraper which should scrape data from more a dozen different websites at first. Later on, it should be possible to upscale the scraper to a few thousand websites.

Those websites are known and should be added iteratively to the scraper. The websites have a different structure each which is why the development and maintenance costs per site need to stay as small as possible. The aim is to scrape the websites on a weekly basis at first. Later on, the scraping intervals should be reduced to a daily basis or even shorter. The scraped data needs to be stored in an useful and efficient way in a database in the cloud. Furthermore, the scraping must be intolerant to changes in the designs of the websites and it must prevent being blocked.

Currently, a simple scraper in Python exists which can scrape a few websites by using the Selenium library. However, this does not need to be continued at all cost.

The following tasks are part of your engagement for the project:

o Developing a modular and scalable software architecture for the web scraping project (preferably with Python)

o Containerizing the program in Docker

o Deploying and managing the containers in the cloud, probably with AWS and Kafka

o Implementing different measures to prevent blacklisting and being blocked

o Setting up a SQL database, probably PostgreSQL with AWS

The following tasks might be part of a further engagement:

o Implementing the web scrapers for a large number of different websites

o Maintaining and monitoring the scrapers for the websites

o Adding a web crawler to find additional websites

o Parsing the stored data and processing them into a more useful format

Your qualifications:

o Web Scraping (Importance: 9/10)

o Python (Importance: 7/10)

o Docker (Importance: 8/10)

o AWS (Importance: 5/10)

o Kafka or other Pipelining/Queuing Tools (Importance: 8/10)

o Cloud Databases (Importance: 6/10)

o PostgreSQL (Importance: 10/10)

You are expected to work closely together with our developer in Germany. The tasks above need to be coordinated and done in cooperation with him. Therefore, a willingness to work between 10 AM and 10 PM Central European Time is required.

We wish to get to know you first by working together in a limited project scope. If you are a fit for our team, we are willing to intensify our cooperation with you and hire you for future projects.

