Analysis on web crawling rules of search engine spider

Source: Internet
Author: User
Search engines face trillions of web pages on the internet. how can they efficiently capture so many web pages to local images? This is the work of web crawlers. We also call it a web spider. as a webmaster, we are in close contact with it every day. I. crawler framework

Search engines face trillions of web pages on the internet. how can they efficiently capture so many web pages to local images? This is the work of web crawlers. We also call it a web spider. as a webmaster, we are in close contact with it every day.

I. crawler framework

Is a simple web crawler framework. Starting with the seed URL, after step-by-step work, the web page is stored in the database. Of course, hardworking spider may need to do more work, such as removing duplicates from webpages and anti-cheating on webpages.

Maybe we can treat webpages as spider dinners, including:

The downloaded webpage. The webpage content that has been crawled by the spider is put in the belly.

Expired webpage. Each time a spider crawls a lot of webpages, some of which are already broken.

The webpage to be downloaded. When you see the food, the spider will capture it.

Webpage. Not yet downloaded or discovered, but the spider can feel them and capture it later or later.

Unknown webpage. The Internet is too big, and many Page spider may never find it, which accounts for a high proportion.

Through the above division, we can clearly understand the work and challenges faced by search engine spider. Most spider crawls based on such a framework. But it is not completely certain. there are always special things. there are some differences in the Spider system according to different functions.

2. crawler type

1. batch type spider.

This type of spider has a clear crawling range and Target. when the spider completes the target and task, it stops crawling. What is the target? It may be the number of webpages crawled, the webpage size, and the capture time.

2. incremental Spider

Different from batch-type spider crawlers, these spider crawlers continuously crawl and update captured webpages on a regular basis. Because web pages on the Internet are updated at any time, the incremental spider must be able to reflect this update.

3. Spider

This type of spider only focuses on specific themes or specific industry web pages. Taking a healthy website as an example, this type of specialized spider only crawls the health-related topics, but does not crawl webpages with other theme content. The difficulty in testing this spider is how to more accurately identify the content that belongs to the industry. Currently, many vertical industry websites need this type of spider to crawl.

3. Capture Policy

The spider crawls and expands through the seed URL to list a large number of URLs to be crawled. However, the number of URLs to be crawled is huge. how can a spider determine the order of the URLs to be crawled? There are many crawling strategies, but the ultimate goal is to give priority to important webpages. Evaluate whether a page is important. The Spider calculates the content based on the original content, link weight analysis, and many other methods. Typical crawling policies are as follows:

1. width priority policy

Width first means that after a spider crawls a webpage, the spider continues to crawl other pages contained in the webpage in order. This kind of thinking seems simple but practical. Because most web pages are sorted by priority, important pages are recommended on pages first.

2. PageRank policy

PageRank is a very famous link analysis method, mainly used to measure the weight of webpages. For example, Google's PR is a typical PageRank algorithm. Through the PageRank algorithm, we can find out which pages are more important, and then the spider preferentially crawls these important pages.

3. big site priority policy

It is easy to understand that large websites usually have more content pages and higher quality. The spider first analyzes the website category and attributes. If the website has already been indexed by many or has a high weight in the search engine system, the record is prioritized.

4. webpage update

Most pages on the Internet will be updated, so that the pages stored by the spider can be updated in a timely manner to maintain consistency. A metaphor: a web page has a good ranking. if the page has been deleted but has a ranking, the experience will be poor. Therefore, the search engine must keep abreast of these and update pages to provide users with the latest pages. There are three common webpage update policies: Historical reference policies and user experience policies. Clustering sampling strategy.

1. historical reference policy

This is an update policy based on a hypothesis. For example, if your webpage has been updated according to regular rules, the search engine also thinks that your webpage will be updated frequently in the future, and the spider will regularly crawl the webpage according to this rule. This is why some water has always stressed that the website content needs to be updated regularly.

2. user experience Policy

Generally, users can only view the content on the first three pages of the search results, and few users can view the content on the subsequent pages. The user experience Policy is updated by the search engine based on the user's characteristics. For example, a web page may be released earlier and not updated for a period of time, but users still find it useful. if you click to browse it, then the search engine will not update these outdated web pages first. This is why the ranking of the latest pages is not necessarily top in the search results. The ranking is more dependent on the quality of the page, rather than the update time.

3. clustering sampling strategy

The previous two update policies mainly refer to the historical information of the web page. However, storing a large amount of historical information is a burden on search engines. what should I do if there is no historical information to refer to when a new webpage is included? Clustering sampling strategy refers to the classification of many similar webpages based on the attributes displayed on the webpage. the classified pages are updated according to the same rules.

From the process of understanding how the search engine spider works, we will know the relevance between website content and the updating rules of website and webpage content, the distribution of links on the web page and the weight of the website affect the crawling efficiency of the spider. Let the spider come more violently!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.