Design and analysis of web crawler in search engine

Last Update:2014-12-19 Source: Internet

Author: User

Keywords Search engine SEO

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

The

is easy to say and the web crawler is similar to the offline reading tool you use. Say off-line, actually still want to connect with network, otherwise how to catch something down? So where are the different places?

1 The network crawler is highly configurable.　　

2) web crawler can parse captured pages of the link

3) Web crawler has a simple storage configuration

4 network crawler has intelligent based on Web page update Analysis function

5 The efficiency of web crawler is quite high

So according to the characteristics, in fact, is called, How to design reptiles? What are the steps to pay attention to? 　

1) Traversal and record of URLs

This larbin done very well, in fact, the traversal of the URL is very simple, for example:

Cat [What you got] tr \ "\\n Gawk ' {print $} ' Pcregrep ^http://

You can get a list of URLs

2) Multi-process VS multithreading

Each has its merits, and now an ordinary PC such as booso.com can easily crawl down 5 g of data a day. About 200,000 pages.

3) Time update control

The stupidest way is to have no time to update the weight, a pass of the climb, turn back a pass of the climb.

Typically the next time the data crawled is compared to the previous one, and if there is no change for 5 consecutive times, then the interval for crawling the page is 1 time times larger.

If a Web page is updated 5 times in a row, the set crawl time is shortened to 1/2.

Note that efficiency is one of the keys to winning.

4 What is the depth of the crawl? The

looks at the situation. If you compare cows, have tens of thousands of servers to do web crawler, I advise you to skip this.

If you have only one server to do the web crawler, then such a statistic you should know:

Page Depth: Page number: page Importance

0:1::

1:20:: 8

2:-6 : 5

3:: £ º 2

4 above:6000: General cannot calculate

Okay, climb to level three.Data volume expanded 3/4 times times, the second is the importance of a lot of decline, this is called "Plant the Dragon species, the harvest is fleas."

5) Reptiles generally do not crawl between each other's Web page, is generally through a proxy out, this proxy has the function of relieving pressure, because when the other side of the page is not updated, as long as the header to get the tag on it, there is no need to transmit all at once, Can save the network bandwidth greatly. The

Apache webserver has a record 3,041 of the cache.

6) Please take care of the robots.txt

7 storage structure when you are free.

This is everyone, Google uses the GFS system, if you have 7/8 servers, I advise you to use NFS system, if you have 70/80 servers, I suggest you use AFS system, if you have only one server, then casually.

gives a snippet of code that I write about how the news search engine is stored:

name= ' echo $URL perl-p-E ' s ([^\w\-\.\@])/$1 eq "\ n"? "\ n": sprintf ("%%%2.2x", Ord ($))/eg ' mkdir-p $AUTHOR

newscrawl.pl$url--user-agent= ' news.booso.com+ (+http ://booso.com) "-outfile= $AUTHOR/$NAME

The

Specifically notes the following sentences:

1. Typically, the next crawl data is compared to the previous one, and if 5 consecutive times there is no change, then the time to crawl this page is enlarged by 1 time times, if a Web page in a row for 5 times to crawl when there are updates, Then shorten the set crawl time to the original 1/2.

Web page update frequency seriously affect the search engine spiders crawl to the site, the more the number of crawls means that the more the probability of the Web page collection, the number of included, is the most basic SEO link.

2. Well, climb to level three is almost, and then in-depth one is the amount of data expansion of 3/4 times times, the second is the importance of indeed a lot of decline, this is called "Plant the Dragon species, the harvest is fleas."

Try to keep the site in the level three directory, deep Web pages will bring great pressure on the search engine, of course, I think Google has enough servers to bear these pressures, but from the side, the 3-level directory of the pages are crawled and updated the frequency is much lower. Before, I said, to find ways to make the site physical structure and logical structure, which is reflected in the good design of the URL, now you can check the next generation of static Web page of the actual directory has several layers, consider whether it can be optimized. (Executive Editor: ADMIN02)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More