Search engine Spider crawler is how to crawl the page

Source: Internet
Author: User
Keywords Search

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

What is a spider, also known as a reptile, is actually a program. The function of this program is to read some information along the URL layer of your website, do simple processing, then feed back to bosses (server) for centralized processing. We must understand the spider's preferences, to optimize the site to do a better job. Next we talk about the spider's working process.

One, the spider encounters the trouble

Spiders are also in trouble? Yes, the person has the difficulty of being a man, do spiders have to do the trouble of spiders! Processing Dynamic Web page information has always been a problem for web spiders. A Dynamic Web page is a page that is automatically generated by a program. As the development of the language is increasing, the development of Dynamic Web page types are more and more, such as ASP, JSP, PHP and so on. These types of pages are not compiled, directly interpreted language, such as our IE is a powerful interpreter, and for web spiders, it may be a little easier to deal with these languages. Web spiders are really more difficult to deal with is a number of scripting languages (such as VBScript and JavaScript) generated Web pages, this is what we do in the site optimization, why repeatedly stressed that try not to use JS code, because if you want to complete the processing of these pages, Web spiders need to have their own script interpreter. The whole spider system is generally in the form of Plug-ins, through a plug-in management Service program, encounter different forms of Web pages using different plug-ins to deal with. The loading of these scripts on the page into the processing, is undoubtedly to increase the time complexity of the spider program, in other words, call these plug-ins is too wasteful spider precious time. So, as a seoer, to do a job is to do site optimization, reduce unnecessary scripting code, to facilitate the spider crawling!

Second, the spider's Renewal cycle

The world is always dynamic, that is to say, it is constantly changing, of course, the content of a Web site is often changed, not update is to change the template. An intelligent crawler also needs to constantly update its crawl page content, also known as updating a snapshot of the Web page. So the spider developer will set an update cycle for the crawler (even this is determined by a dynamic algorithm, this is what we often say is the algorithm update, so that it at a specified time to scan the site, to see which pages are needed to update work, such as: the title of the home page has changed, Which pages are new Web pages, which pages are expired dead links and so on. A powerful search engine's update cycle is constantly optimized, because the search engine's update cycle has a great impact on search engine recall. However, if the update cycle is too long, it will reduce search engine accuracy and integrity, there will be a number of newly generated web pages can not be searched, if the update cycle is too short, the technology is more difficult to achieve, but also to the bandwidth, the server's resources cause waste. Therefore, a flexible search engine update cycle is still important, update cycle is the eternal topic of search engines, but also programmers and SEO constantly to study the subject.

Spider crawling Strategy

In the above we introduce the spider is afraid of what and update cycle of these two topics, now we step into the key theme: crawling strategy.

1. Crawl strategy by layer

Search engine through the web crawler to collect Web pages, this process is an algorithm, specific reference to the graph and the tree of two data structures. We know that a site has only a home page, this is the spider crawling began to crawl the place. Get the page of the site from the first homepage, then extract all the links in the main page (that is, the inner chain), then according to the new link to get a new page and extract the links in the new page, repeat the process, until the entire station of the leaf node (that is, each column under the Face Column content page) This is the crawler to collect the page. Because many web sites have too much information on their web pages, if this climb is often to climb for a long time, so the site page in a large direction is to crawl, for example, only two layers using layer-by-step crawl strategy, this can avoid the information extraction "into", making the web crawler inefficient. Therefore, the traversal algorithm used in the crawling of web crawler is mainly the breadth-first algorithm and the best priority algorithm in graph theory, and the depth-first algorithm is less used because it is easy to cause the extraction.

2, do not repeat the crawl strategy

The number of pages on the World Wide Web is very large, so it is a huge project to crawl, the Web page to capture the need to spend a lot of bandwidth, hardware resources, time resources and so on. If you repeatedly crawl the same Web page will not only greatly reduce the efficiency of the system, but also caused the problem of high precision. The common search engine system designs the strategy of not repeating the web crawl, which is to ensure that the same page is crawled only once in a certain period of time.

B-Tree Scientific name: Balanced multiple lookup tree, this principle is widely used in operating system algorithms. B-Tree search algorithm can also be used to design a search engine without repeated crawl URL matching algorithm (that is, contrast).

The above text, process, method source Guangzhou SEO Center (official website: http://www.seoxoyo.com) All, reprint please specify or retain this paragraph of text.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.