Web Crawler

Overview

Introduction
Requirements
Key Questions
Architecture Overview
URL Frontier
System Flow
Additional Discussion Points

Introduction

A web crawler, (a.k.a. spider, spiderbot), is a type of bot that is used to systematically browse the Web. Common use cases of web crawlers include:

Indexing for Search Engines
Data mining
Archiving

Requirements

Functional Requirements
- Given seed URLs, crawl related pages
- Ignore duplicate pages
Non Functional Requirements
- Ability to prioritise URLs
- Politeness (i.e. not overloading a website with requests)

Key Questions

1. What type of content will we need to store?

HTML, but should be extendable for other forms of content

2. How many pages will we need to crawl each month (total storage requirements)?

Average pages crawled / month: 1 billion / month (10⁹)
Average page size ≈ 2.5 MB (2.5 * 10⁶ bytes)
Monthly storage requirements: 10⁹ pages * (2.5 * 10⁶ bytes/page) = 2.5 * 10¹⁵ bytes
Yearly storage requirements: 12 months * (2.5 * 10¹⁵ bytes) = 3 * 10¹⁶ bytes
Assume store for 5 years
Total storage: 5 years * (3 * 10¹⁶ bytes) = 1.5 * 10¹⁷ bytes
- Petabyte (PB): 1PB = 1024⁵ bytes
- (1.5 * 10¹⁷ bytes) / (1024⁵ bytes) ≈ 133 PB

Architecture Overview

Seed URLs

Seed URLs are the URLs that have been chosen from which to start crawling.
The seed URLs to choose will depend on the use case of the web crawler. For example:
- High level directories like Wikipedia, Government websites etc.
- News aggregators like RSS feeds, social media platforms etc.

URL Frontier

The URL Frontier is one of the most important parts of the web crawler as it used for URL prioritisation and to ensure politeness.
Deeper dive in the next section.

HTML Fetcher & Renderer Threads

Horizontally scaled with consistent hashing to distribute the load, to ensure that the system can handle the large number of URLs to be crawled.
Reaches out to a custom DNS resolver which is used to convert domain names into IP address. A custom approach is used to avoid the expensive DNS resolution process.
Implement caching here, especially if we are visiting the same domain multiple times.

HTML Parser

Content Parser parses HTML pages and checks if pages are malformed.
Instead of just discarding malformed content, the system could have some formatting logic which can attempt to resolve common HTML issues like missing closing tags etc.

Duplicate Detection

There’s a lot of duplicate information on the web and so the system needs some way of detecting this.
One way is to use MD5 hashing function which is used to create a unique "fingerprint" of a digital file or message.
It takes any input of arbitrary length and produces a fixed-length output (128 bits, represented as 32 hexadecimal characters) that serves as a compressed representation of the original data.
- Pros: Fast, efficient, and easily scalable.
- Cons: Ignores minor changes, might miss near-duplicates, potential collisions (although less likely with stronger hashes).

Caching

To make sure our system is performant having some sort of caching mechanism which can quickly check if there is a duplicate is important.
A bloom filter could be used for this purpose. A bloom filter is a probabilistic data structure that is very space-efficient and could be implemented using tools like Redis.

Content Storage

This system could use a NoSQL database like Cassandra which supports indexing and filtering and is good for frequent updates. However, it isn't optimal for the kind of data this system will be storing.
Instead this system could use a distributed file system:
- Pros: Highly scalable, cost-effective at large scale, good for static data.
- Cons: May not be as efficient for frequent queries, requires expertise in distributed systems.
- Examples: HDFS, Google Cloud Storage, Amazon S3.

URL Extractor

Extract links from the page so that the crawler can continue to crawl and discover new content.

Kafka

Instead of just having the URL Extractor, the system could utilize Kafka to allow us to add additional services like analytics, image downloading and ultimately make the system extendable for future enhancements.

URL Filter

Filters URLs based on predefined rules
For example we could prevent certain sites (e.g. adult sites, malicious sites) as well as malformed URLs.
This improves the overall efficiency and effectiveness of the system by reducing the amount of irrelevant content crawled.

URL Seen Detector

Check that the URL has not been seen before adding it to the URL Frontier. This prevents infinite loops where the same pages are repeatedly being crawled.
Bloom filters or hash tables could be used here to achieve this.
The URLs that have not been previously seen are then sent to the URL Frontier for crawling.

URL Storage

Stores URLs that have previously been visited.
A NoSQL database like Cassandra could be used here. It is highly scalable, often more cost-effective for large datasets than other SQL databases.

URL Frontier

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet.

System Flow

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet.

Additional Discussion Points

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet.