wasiq.dev

Building a Simple Search Engine: Part 1 - Mercator-based Web Crawler in Python

Introduction

Have you ever wanted to build your own search engine? Well, it's not as complex as it might seem. In this three-part series, we'll explore how to build a simple search engine from scratch, starting with a parallel web crawler using the Mercator scheme in Python. This is the first part of the series, focusing on the crawling process.

Mercator Scheme

The Mercator scheme is a distributed crawling architecture designed for efficiency and scalability. It utilizes multiple queues to manage URLs:

Frontier Queues: These hold URLs waiting to be crawled.
Back Queues: These hold URLs that have been downloaded and need further processing.
Mercator Back Selector: This prioritizes which back queue to fetch URLs from next.

This scheme balances workload across multiple workers and avoids overloading individual domains.

Implementation

Our crawler implementation in Python utilizes libraries like requests, beautifulsoup4, and urllib. Here's a breakdown of the key components:

1. Initialization:

A Frontier class manages the queues and prioritizes URLs.
Seed URLs are provided to initialize the queues.
Crawler parameters like thread count and crawl limit are defined.

2. URL Fetching:

A fetch_data function retrieves HTML content for a given URL.
A parse_data function extracts links from the downloaded HTML.
A filter_URLs function checks robots.txt and removes duplicates.

3. Mercator Crawling:

Threads call a crawler_thread_task function to crawl URLs.
Each thread acquires a lock before accessing shared resources.
A get_URL function retrieves a URL from the frontier, considering back queue priorities.
Crawled URLs are stored in a global set to prevent duplication.

4. Data Management:

Crawled HTML content can be saved for further analysis.
Extracted links are filtered for robots.txt restrictions and duplicates.
New URLs are added to the frontier for further crawling.

Interactive Code Snippet:

Here's an example of how the get_URL function prioritizes URLs based on back queue weights:

def get_URL(self):
    timestamp, idx = self.back_selector.get()
    time_elapsed = time() - timestamp
    sleep_time = 0
    if time_elapsed < WAITTIME:
        sleep_time = int(WAITTIME - time_elapsed)
    next_url, sleep_time = self.back_queues[idx].get(), sleep_time
    if self.back_queues[idx].empty():
        self.domain_map = {k: v for k, v in self.domain_map.items() if v != idx}
        self.add_to_backqueue(idx)
    self.back_selector.put((time() + sleep_time, idx))
    return next_url, sleep_time

This code demonstrates the dynamic allocation of URLs to worker threads based on back queue priorities and wait times for politeness.

Conclusion

This first part of the series established the foundation for building a simple search engine using the Mercator scheme. We've implemented a parallel web crawler in Python, highlighting the key components and interactive code snippets. Stay tuned for the next part, where we'll explore creating an inverted index from the crawled data!