Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Pretreatment believe that everyone will not be unfamiliar, many webmaster or other SEO information is called "Index". For search engines, indexing is the most important step, and web crawling crawl and ranking have a direct relationship. Search engine crawl pages can not be used for rankings, because the Internet data is huge, so when the user is not in real-time search from all the pages to retrieve and return, but from the search engine's own database to return to the user results. This database is processed beforehand, so there is the argument of preprocessing.
Preprocessing is what we can not see, are the search engine's background program is completed, from nine aspects and everyone on the pretreatment of each stage, I hope the webmaster saw a general understanding, due to space is limited, today from three aspects to share first, if there is wrong place, but also please more correct.
First, extract the text: Now the Internet information or text-oriented, so the focus of the search engine or text, usually we see from the Web page, including a lot of pictures, video and JS technology can not be ranked content users. So for search engines, the first thing to do is to extract the text from the page. In addition to some common body text, but also extracts include meta tags in the text and the image of the alt tag, and so on. Another is the anchor text, anchor text in the page ranking role is very important.
Second, Chinese participle: participle in fact for Google also exists, but generally speaking are Chinese participle. For English, just according to the word to split on the line, and the Chinese situation is often more complex than English, so for Chinese search engine, especially Baidu, to consider the use of Chinese users, so the treatment of participle also has its own unique place. In the site optimization, we can do little to participle, can only be bold or use the H tag to tell the search engine which words are linked to a word.
Third, the elimination of stop words: in real life we often take some exclamation or auxiliary words to express semantics, the Internet is the same, whether in Chinese or English, there will be some high frequency, but the content has no real impact on the word. There are "" "" "" "" "", "" "" "Ah" "Ha" "ah" such as exclamations, there will be "but", "with" adverbs and prepositions. In search engines, these words without substance are collectively called stop words. Search engines Remove these stop words when they crawl the page, making the topic more prominent and reducing the amount of computing.
Four, noise elimination: You may not understand what is called noise, in the internet, noise refers to the theme of the site has no substantive help page elements, such as a lot of copyright notice text, navigation bar and advertising content. Many blogs in the article classification pages, historical archive pages are noise elements. The content of the Internet is huge, so, the search engine can not put these no substantive content crawl and index, will be in the crawl, first, he will be based on the HTML page tag to distinguish between the rest of the main content to crawl. From this point of view, we should try to show enough text content to provide search engines rather than other factors.
Here, through the extraction of words, Chinese participle, eliminate stop words, noise elimination and other four aspects and you share the pretreatment of the search engine, here just to give you a simple list of the following, in fact, the situation is much more complex, the details of things are more. Here is just a good idea, I hope that more in-depth understanding of friends also to share, let us make progress together. Search engine preprocessing Total nine stages, this article first summed up the first four, the remaining five will continue to share with you.
Well, this article to here, we have good ideas welcome and I exchange, this article from: Shenzhen website construction, Web site: http://www.zijiren.net, if there is wrong place, also welcome to correct, also welcome to reprint, reprint please keep the link, thank you!