An introduction on web crawlersChicao's I/O Bus

April 23, 2017 | tags: web crawlers, -- (permalink)

An overview

It's not unusual for a page have hundreds or thousands of references of external sources. In principle, the concept of hypertext is to create a way to link an information being used inside a given text so the reader can follow and explore even more the concepts and contents being written. For instance, let's take an article in Wikipedia <http://wikipedia.org>. Most of the articles have references of external sources to give a reasonable amount of reliability into the information being clarified by the written content. This situation define a connection of this Wikipedia page with the given reference, which in the other hand might have other connections on it's own, increasing the number of pages one can find by following links exposed in these pages.

With this in mind, Web crawling is the process by which it is possible to gather pages on the internet, extract other pages links and create an index to support a given objective, usually a search engine. Crawling aims to quickly find and hold as many useful pages as possible, by means of pages being referenced by links.

The way a crawler achieves its goal is by starting from an initial set of URLs known as seeds, from which it picks one from it and fetches the referenced page. Then, the downloaded page is parsed by extracting content and links referenced, where the content is added to a text indexer and the links are added to the URL frontier, pages that weren't visited yet (at principle the frontier is defined as the seeds, but as the incremental crawling works, the set of URLs to be fetched grows.

Although the simplicity of the crawler's job, some criteria must be met on how the goal will be achieved since crawlers must have some features and should have others. Robustness and politeness are features that every crawler must have to perform well. There are a lot of spider traps on the web, which are generators of web pages that mislead the behavior of crawler by downloading an infinite number of pages in a particular domain. So a crawler must be robust enough to avoid these situations. Regarding how the crawler visit a page, there are some rules regarding on how often and how many times the page or domain should be downloaded. If a crawler is too greedy, it might be confused with a DDOS attack and be blocked by the server. There is also on which areas of the page the crawler should visit, defined by the robot protocol <https://en.wikipedia.org/wiki/Robots_exclusion_standard>, defining the standards on how polite a crawler should behave while download info from the page.

Since the crawler is dealing with the ocean of pages and information that is the web, some features are desirable to provide good results. Distributable, scalable, efficient, extensible are some of the features that a crawler should have, along with the result quality and freshness.

Brief history of web crawlers

Spring 1993:1996 - World Wide Web Wanderer - Matthew Gray - Perl, one machine only

Implemented to raise statistics on the growth of the web
The pages crawled where indexed in the Wandex, giving rise to the first search engine

The Internet Archive crawling system brought to light the problems related to web crawling such as scalibility, throttling, DNS blocklist, robot exclusion and politeness concerns.

Page Rank and First generation Google more relevante in the aspect of the search algorithm, less in the crawler system archtecture. Rise of the social concerns of web crawling (webmasters' complaints)

Mercator and Polybot represented blueprints design of crawlers, where Mercator defined a highly scalable and extensible design and Polybot addressed the issue of scalability data structures and polite crawling techniques.

IBM's webfountain defined prioritization of URLs and incremental crawling.

A prototypical crawler archtecture

Bots might have different ways to deal with the multitude of problems that the web crawling presents, but today most of the behavior of the crawling process is well defined. The Mercator crawler defined the basis for a number of research and commercial crawlers. Basically, it starts with the URL frontier data structure, which sends URLS to the HTTP fetcher to download the page of the given resource. However, the fetcher calls the DNS solver first to get the resource IP address and check the robot exclution policies. After these verifications, the page is fetched. Once it had a successful download, the page is parsed to the link extractor, gets hyperlinks contained on it and sends them to the URL distributor, where it's decided which process the URL will be assigned. Then, the URL is passed through filters that remove duplicate or not relevant resources to the crawler objective, such as urls with file extensions or black-listed sites. At the end, the URL is passed to the URL prioritizer, which defines the URL position in the frontier so the process can restart to these newly added resources.

Design issues

Pages are downloaded recursively, so the growth of information is pretty high and keeping track of the pages downloaded and to be fetched might help to prevent the crawler from visit and download already fetched pages.