Webmaster Analysis of search engine preprocessing from nine aspects (II.)

Source: Internet
Author: User
Keywords Search

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

On the article webmaster from nine aspects of search engine preprocessing (i), from the extraction of words, Chinese word segmentation, eliminate stop words, noise elimination Four aspects and everyone share the "index" preprocessing, I believe that these basic articles for everyone will also be helpful. Today, the next article, continue from the weight, forward index, inverted index, link relationship calculation, special document processing five aspects and everyone to share.

Preprocessing is a more complex part of the whole search engine, this article from nine aspects and everyone elaborated some basic knowledge, let everyone have an understanding, for future website design and SEO will be helpful. Of course, these are only some knowledge of their own learning, if there is a wrong place, but also hope that everyone to correct. Well, start today's text.

Five, at any time to heavy: to go heavy is a more important part, because the Internet information is huge, plus itself everyone likes to share, so it causes a lot of duplication of content. If the search engine does not carry out to redo, then can cause a lot of duplicate crawl and collect. Search engines commonly used to come and go heavy method is the page for the keyword fingerprint calculation, the typical MD5 algorithm, will be selected from the page of the optimal representation of some of the keywords to calculate, so as to determine whether the article is original. Fingerprint calculation is often accurate to the paragraph, so the general pseudo original will be found by the search engines, it is easy to determine that you are copying.

VI, forward Index: Forward index is also referred to as index, spiders in the Web page extraction, segmentation, noise cancellation and after the heavy, will be able to respond to the subject of keywords. Search engine will these represent the theme of the topic of the keyword set up a collection, while recording each keyword on the page appear on the number, format, frequency, and so on, and then put these collections into the index library, in a large index library, each file corresponds to an ID, content is a series of keyword combinations, Then the search engine will continue to have enough of their own index library and for the ranking to do directly bedding.

Seven, inverted index: The forward index mentioned above can not be directly ranked users, the user ranking is inverted index. People think, if the index users ranking, when users search for a keyword, you need to search all the files containing the keyword, then the workload will be very large and unrealistic. Search engines often reconstruct the forward index library and convert it to a inverted index. Inverted index of the structure of a keyword for a number of files, when the user search for a keyword, it will be in this keyword down search for the corresponding file, so the processing speed will be much faster, but also easier to achieve.

Eighth, link Relationship calculation: Link relationship calculation is always the most concerned about one of them, now the mainstream search engine will be the link between the page calculation as a very important part of the page to see which links can pass weight, then just play a guiding role. In particular, Google PR value is based on this link between the calculation, other search engines have similar calculations, but not known as PR. Link relationship is often very complex, the calculation will take a long time, here do not do in-depth sharing, just to let you know that the preprocessing of the existence of link calculation.

Ninth, Special document processing: Web pages are often not just HTML files, there are many types of files. Search engines will actively crawl text based PDF, Word, txt files and so on. We often find such search results in search results. But for Flash and pictures, although the search engine has been working hard, but the distance from the direct reading of its content is still far, so if you want to do SEO, try to use less pictures and flash. Should try to use more text to make the search engine without obstacles crawling.

Here, through at any time to heavy, forward index, inverted index, link relationship calculation, special file processing five aspects to share this chapter, plus the article, a total of nine areas need webmaster friends to understand, hope to see this article can be helpful to everyone. Well, this article to here, we have good ideas welcome and I exchange, this article from: Shenzhen website construction, Web site: http://www.zijiren.net, if there is wrong place, also welcome to correct, also welcome to reprint, reprint please keep the link, thank you!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.