Vertical Type crawler Architecture design

Source: Internet
Author: User

Engaged in the development of reptile direction will be nearly two years, today, the friend asked me about the crawler architecture design problems. In fact, so long also want to summarize their entire development process, architecture design problems. Make some summaries of yourself. For reference only.

1. Crawler classification:

For me, reptiles fall into two categories:

Crawlers that need to load configuration files and crawlers that do not need to load configuration files.

In fact, it is often said that the vertical crawler and width (depth) traversal crawler.

2. Crawler's architecture:

1. Width traversal crawler.

SEO friends probably know that if a website needs Baidu to crawl as soon as possible, then just optimize keywords, improve weight and so are relatively simple optimization, crawler-friendly site in the construction of the time to consider the crawler's friendly. such as: Tag settings, architecture design ...

Width traversal crawler is the width of the crawl, the site is designed to be directly in accordance with the width of the design ... Not very well understood, also not very good explanation ... The approximate meaning is less depth ... such as After entering the homepage---can see each category, click Category to see the category list---Click the category list is the content ... Reduce interaction with users and allow crawlers to find the focus directly.

Each major search engine uses the Width traversal method to crawl the data. You can build a website. Set level more than 10 depth: Home-categories-re-classification---after class more than 10 to see the content. You can try the crawler crawl effect how.

Width Traversal crawler is a lot more common, Nutch, Heritrix, and so are the typical representative of the width traversal crawler, more commonly used is Nutch.

2. Vertical type crawler

Vertical type Crawler I personally feel that in recent years has been proposed, vertical crawler focus on content and accuracy and efficiency. The more common are public opinion projects, financial projects and so on. Just grab the useful data, and at the beginning of the crawler crawl can be the content of simple processing: such as. Extract title, content, time, etc.

Vertical crawler focus on efficiency, relatively speaking. A web site has 100,000 links, the width of the crawler needs to download 100,000 links to your database, and vertical crawler only need to download the data useful to you, perhaps only thousands of or 10,000, if the vertical crawler crawl so little data than the width of the crawler crawling crawl slow, So this project is rubbish at this point.

Vertical crawler relative to the width of the crawler Another special place is: the width of the crawler is customizable, according to a website custom-made. For example: Crawl QQ space, Sina Weibo, etc., are required to login ... Or if the content of some sites requires JavaScript to be called to display, then the JavaScript parsing engine is required to ensure the quality of the crawl. Plugin customization is another definition of vertical crawler.


3. Crawler Design

Crawler think of complex to do is actually not complicated, simple is the site link discovery, site connection processing, link download---Continue to find links ...

Access to a site first needs to provide the site's web address. For example: Http://www.baidu.com after Baidu will need to download the page, after downloading the page can find a few links. Http://news.baidu.com http://music.baidu.com and so on, after getting these links is to open these several links, continue to get new links .... Keep on going ...

If, just this, then the crawler will ' die ' on this site ... For example, HTTP will also be found on other pages; ' Www.baidu.com then get in here again. This cycle has been so ... It's really dead.   So we need to effectively manage the link, to the weight of the ... In this crawl process, a link has been crawled, then the link found, or link to download the time can be directly removed from the link, do not download. Then there will be fewer links, then the entire site crawl will have an end!

OK, the ability to express is not very good, simple summary, no matter what type of crawler to write, all need these: link discovery-Connection management-Link download-then also get the content.

4. Vertical type crawler vertical?

Vertical crawler In the beginning of the reptile when I think of why it is called vertical crawler. Now it's been a long time and it's not very understanding, I probably think vertical means to get the vertical, as if there is a goal to go directly to get what you want, so vertical ...!

The vertical crawler is different from the normal reptile in that it is purposeful. The purpose is very clear relative to other reptiles. It's like I'm going to take an article, and nothing else, so vertical crawler can satisfy you!

Vertical crawler design and width traversal crawler design difference is not very big, same .... Link discovery--connection management--Link download. However, most of the vertical crawler more than a message extraction to extract the content is the largest vertical crawler, but also vertical crawler the most important function. So the vertical crawler should be reflected in the link discovery-Link management-Link download and so on.

I think: vertical crawler is more suitable for enterprise-class, because the enterprise-class information content is clear. and vertical crawler is more small, easy to manage. Enables customization of functions.

Vertical Type crawler Architecture design

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.