Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Do not understand the principle of search engine seoer is in the nude.
Well, before the end of the nonsense, insert another: China's first search engine based on Web page index is the Skynet of Beijing University.
OK, first on the map to simply look at the search engine "axes": Data collection-> preprocessing "index"-> rankings.
Data collection
That is, the data collection phase, the Web page from Hao such as vast 9201.html "> Internet World Collection into their own database for storage."
1, grasping maintenance strategy
Faced with a lot of data to deal with, many problems need to be considered in advance. For example, "instant crawl" data or "crawl in advance"? When maintaining the data, is the "regular crawl" (regular once deep crawl, replace the original data) or "incremental crawl" (with the original data as the foundation, the old and new alternating)?
2, Link Tracking
We all know that spiders crawl along links and crawl pages. How to quickly crawl to the user's relatively important information and reach a broad coverage is undoubtedly the search engine needs to focus on the issue.
First, how to capture the important information.
To know this, home to understand how people are subjective to determine whether a page is important (think for yourself first). In fact, there are several situations:
The page has the historical weight accumulation (domain name and so on long time, high quality, old age), many people will mention this page (outside the chain point), a lot of people will refer to this page (reprinted or mirrored), this page for users to quickly browse (level of light), often have new content appear (update) and so on.
In the link tracking phase, the only information that can be obtained is "this page is easy for users to quickly browse (the level is shallow)", other information has not yet been obtained.
For information coverage, in fact, spiders in the tracking link when the two strategy: deep crawl and breadth crawl.
Think with your butt. Also know that breadth-grabbing helps to get more information, and deep crawl helps to get more comprehensive information. Search engine spiders in the crawl data, usually used in both ways, but want to compare, breadth crawl more than deep crawl.
3. Address Library
Search engine in the initial establishment, must have a manual input of the Seed bank, otherwise the spider will be in the connection tracking will not be able to do. Along these seed banks, spiders can find more links.
Of course, more than one search engine will release a page submission portal, so that the webmaster will submit the site.
But it's worth mentioning that search engines prefer the links they find.
4. File storage
Link Tracking is complete and you need to store the information you are tracking. The stored object, the first is the URL, and the second is the page content (file size, last update time, HTTP status Code, page source code, and so on).
About the URL, because last saw a pan-port cheating site, here a simple mention. A URL consists of a transport protocol, a domain name, a port, a path, a filename, and so on.
Preprocessing index
When data is crawled, preprocessing is required (and many people like to call this step index). Mainly from the extraction of words, participle, indexing, link analysis and other aspects to carry out.
1. Extract text
A good understanding of the source code to extract the text. Of course, it should be noted that this includes meta information and some alternative text (such as ALT tags).
2, participle
Every step, always want to sigh under the profound Chinese characters. Ah, Ah!
Sigh finished, continue to walk.
Participle is a unique step in Chinese, that is, according to the meaning of the sentence to be expressed in the text to split. In general, there are two ways in which participle can be based on dictionaries and statistics.
In order to be more effective in machine segmentation, "forward matching" and "reverse matching" are usually used in two ways. It is worth mentioning that the "reverse matching" approach makes it easier to get more valuable information (think about why).
If you are interested in participle, you may wish to take a look at this article.
Need to emphasize that, in order to facilitate the phrase after the word can better express the core meaning of the article, will be to pause the word (, ah, um, such as the words) and denoising (navigation, copyright, classification, etc. to the main meaning of the content of the impact of wood) processing.
3, go to Heavy
After the pause, after the noise of the remaining phrases, has been able to very good expression of the main body of the page meaning. In order to make content not be repeated by search engines, search engines need an algorithm to deal with.
For example, more well-known and commonly used for MD5 algorithm, please click link to Baidu Encyclopedia of their own brain repair.
4, the establishment of the index
Go to the end, is a very often mentioned forward index and inverted index.
5. Link algorithm
At this stage, the links between the pages are also collected. In order to facilitate a review of the above, the elder brother specially spent a lot of effort to make a figure.
Ranking
The index file is set up, not far from the rankings.
1, the Search word processing
Search engine will be the same word word processing (think about why), here, but also can not help but think of the profound Chinese characters.
For this, I want to add a concept called text granularity. Well, in order to avoid the mistaken children, or give Baidu official explanation here.
2. File matching and subset selection
According to Baidu's official statement, the user search words for word processing, it can be recalled to the index library. One thing to consider here is that users tend to look at the results of previous pages. So for the resource meter, search engines tend to return only part of the results (Baidu shows 76 pages, Google 100 pages), that is, the recall of the index library in the subset file.
3. Correlation calculation
In general, there are five factors that affect the relationship.
About this part, that is, we often talk about SEO optimization means and methods, here will not repeat.
4, ranking filtration and adjustment
In fact, after the correlation calculation, the results have been generally determined. Just to punish some suspected of cheating site, the search engine will be in this section of the results of fine-tuning.
such as Baidu's 11-bit mechanism.
5, the results of the display
Take a deep breath and finally see the results.
The results returned include title, description, snapshot entry, snapshot date, url, and so on.
Here is worth mentioning is, not only describes the search engine can dynamically crawl, perhaps in the near future, title will also be dynamic crawl.
Original address: http://www.seosos.cn/seo-tips/search-engine-principle.html.