Search engine working mode and basic grasping principle do you understand?

Source: Internet
Author: User
Keywords Crawl search engine
Tags .url added based basic content download example google +

Absrtact: Recently in reading a book, found that when we do not practice directly to see the principle is boring, and after the practice in turn to see the principle, will feel a lot of truth, a lot of sentiment. Take me to do the negative example, said I was a search engine optimization workers, I search

Recently looking at a book, found that when we do not practice directly to see the principle is boring, and after the practice in turn to see the principle, will feel a lot of truth, a lot of sentiment. Take me to do the negative example, said I am a search engine optimization workers, I search engine work and basic principle of grasping, update strategy do not understand. So, what about you? Next, share my reading notes, only when the new people literacy.

Before introducing the search engine crawler, first understand the crawler to classify the Web page, four kinds:

1, expired pages and downloaded pages

2. Pages to download

3, the Web page

4, Unknown Web page

Below I will describe in detail how the search engine updates downloaded pages, how to download the Web pages to download, how to handle the Web pages that are known but not crawled, and how to crawl an agnostic Web page.

I. Processing of Web pages to be downloaded

Crawl strategy: In a pile of knowable Web pages, the search engine will pull out the URL to crawl pages, crawler crawl page URL in order to form a queue, the scheduler every time from the queue to remove a URL, sent to the Web download download content, Each newly downloaded page contains URLs that are appended to the end of the grab queue to form a loop, which is the most basic algorithm. But not the only way.

This is purely in order to crawl, but search engines generally choose important page first crawl. The importance of Web pages, most of which are based on the popularity of the Web page to crawl, the popularity of the Web page, Google officially has a sentence refers to exposure, popular meaning is the reverse link. (so there are so many people doing outside the chain)

There are generally four options for selecting important pages: Breadth-first traversal strategy, incomplete PageRank (non-Google PR value) strategy, OCIP strategy, major station priority strategy

1, Width first traversal strategy: The new Downloaded Web page contains links directly appended to the end of the grab URL queue. Seemingly very mechanical, in fact, contains a number of priority strategies: if more than the chain, it is more likely to be the width of the first traversal strategy to crawl, the number of the side of the chain to show the importance of the page. (This is why to do the site link)

2, incomplete PageRank: The front is determined by the quantity, this is added to the quality.

Initial algorithm: The downloaded Web page added to the download URL queue to form a collection of Web pages, in this collection to calculate the PR, and then the queue with the crawl in accordance with the PR rearrangement, in this order to crawl.

(It is too inefficient to recalculate the order after each new download)

Every time you save enough K pages, you are recalculated. But the question is: the newly extracted web page does not calculate PR without PR value, their importance may be higher than already in the queue to do?

Solution: Give each new draw to give a temporary PR, this temporary PR is based on the link to the PR value of the total value. So in the calculation, if the higher than the queue is limited to crawl him. This is incomplete PR

(PR high will give priority to crawl, include many rankings by the former opportunities are also larger, so there will be so many people improve SPR)

3, Ocip (Online page importance computation) strategy: The importance of on-line pages, improved PR algorithm.

The algorithm starts with the same cash on each page, and when the page is downloaded, the cash is evenly divided into his export page, and his own is emptied. These export pages are placed in a crawl queue, and are prioritized by the amount of cash.

and PR difference: PR On the previous page is not empty, each iteration to recalculate, and this does not have to recalculate all empty. And PR There is no connection to the jump, and this as long as no connection will not transfer cash.

4, the station priority: with the crawl in the queue which site more priority to crawl which. (So the Web page to be rich, content to be rich)

Ii. Update downloaded Web pages

Above is the search engine's crawl strategy. The finished page is added to the downloaded pages, the downloaded pages need to be constantly updated, then how to update the search engine?

General Web page Update Strategy: Historical reference Strategy, user experience strategy, clustering sampling strategy

1, historical reference: The past frequently updated, now may also be frequent. Use the model to predict the future update time. Ignoring the frequent updates to the navigation bar and ads, so frequent updates to the navigation are useless and heavy on the content (now you know why the update is going to last, there's a pattern)

2, user experience: Even if the page is outdated, need to update, but if I update the user experience without affecting the search engine to update later. Algorithm is: Web page update on search engine quality impact (generally see rankings), the impact on the update as soon as possible. So they will save a number of history pages, based on the impact of previous updates to determine the impact of the update on the quality of the search engine size.

The above two kinds of shortcomings: rely on history, to save a lot of historical data, increase the burden. It would be inaccurate if there were no historical records.

3, Clustering Sampling strategy: the page classification, according to the same category of Web page update frequency update all this category of pages. Extract the most representative, see his update frequency, after the same industry in accordance with this frequency.

Third, crawl unknown web page

An agnostic Web page is a dark web, and search engines are difficult to data in a conventional way. For example, no linked sites, databases. For example, a product inventory query, you may want to enter the product name, region, model A series of text to query inventory quantity. and search engines are hard to crawl. This has the query combination and the isit algorithm.

Let's introduce the next two concepts:

1, rich in information query template: For example, a query system, I set a query template, each text box input what signal, region, product name, etc., form a different query combination. The difference between the different combinations is very large, is rich in information query template.

How is this template determined? The crawler first from the one-dimensional template, such as the first other than input on the input area, to see whether it is rich in information query template, is extended to two-dimensional template, such as Region + model. So increase the dimension until there is no new template.

2, the combination of words: Maybe you wonder, how do reptiles know this input box to enter what is the region or product name, or time? So the crawler began to need manual prompts, manually provide some initial query seed table, crawler more with this form query download page, and then analyze the page, automatically mining new keywords, form a new query list, and then in the query, submit the results to the search engine until there is no new content.

This completes the crawl of the dark web.

The above is just a simple introduction to crawl and update the crawler framework, the specific algorithm can be more complex, waiting for me to study later to share.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.