Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
"Preprocessing" is a search engine processing a vast web page of an important process, like a chef chefs before the cut, many friends always ask, why my new station has not been included in the search, in fact, search has already found you, just in the preprocessing phase, so you must have patience. Let's talk a little more about how hard and necessary the search is at this stage.
The first step: Remove the ads in the source file, divide the template, and take out some key words that can represent its content.
Arbitrarily take a Web page source file, you can see its code is very complex. In addition to the text we can see from IE, there are a lot of HTML tags. Root statistics, the size of the page document source file (in bytes) is usually about 4 times times the size of its core content. Because of the diversity of the source of HTML documents, many pages are more casual in content, not only text is not standardized, complete, but also may contain many and the main content irrelevant information (such as advertising, navigation bar, copyright notes, etc.). These situations pose challenges to effective information queries. Search must extract some features from the Web page source file that can represent its content. From the perspective of people's understanding and practice, the key words are the best representative of this feature. Thus, as a basic task in the preprocessing phase, it is to extract the keywords contained in the Content section of the Web page source file. For the Chinese language, it is to be based on a dictionary σ, with a so-called "word-cutting software", from the Web page text to cut out the words contained in σ. After that, a Web page is largely represented by a set of words, p = {t1, t2, ..., TN}. Generally speaking, search may get many words, the same word may appear in a Web page many times. In terms of effects (effectiveness) and efficiency (efficiency), you should not allow all words to appear in the presentation of a Web page, and remove words such as "," and "in" without any indication of content, called Stop word. In this way, the number of valid words for a webpage is about 200.
The second step, repeat or reprint the elimination of the web page
The inherent digital and networked to the reproduction of the Web page, as well as the reprint and modification of the publication has brought convenience, so we see the information on the web there are a lot of duplication, such as the younger brother of the Express Network (www.gx-banjia.com), has been copied, the result is that he was included, and I did not. A large statistical analysis in 2009 showed that the average repetition rate of web pages was about 8. In other words, when you see a Web page on the Internet, there are 7 different pages on average that give the same or basically similar content. This phenomenon for the vast number of netizens is positive, because there are more information access opportunities. But for search engines, is mainly negative; it is not only in the collection of Web pages to consume machine time and network bandwidth resources, and if the results appear in the query, meaningless consumption of computer display resources, will attract users complain, "so many repeat, give me one is enough." Therefore, it is an important task in the preprocessing phase to eliminate the duplication of content or the repetition of the topic content.
The third step, link analysis
As mentioned earlier, a large number of HTML tags not only cause some trouble to the preprocessing of web pages, but also bring some new opportunities. From the point of view of information retrieval, if the system is facing only the content of the text, that is, the content contains a set of keywords, the maximum number of words and word in the document collection of the frequency of the documents such as statistics. Frequency information such as TF and DF can, to some extent, indicate the relative importance of words in a document or the relevance of certain content. With HTML tags, the situation can be further improved, for example, the information between,<h1> and </H1> in the same document is likely to be more important than the information between <H4> and </H4>. In particular, the information contained in HTML documents that links to other documents is an object of particular concern in recent years, and it is believed that they not only give the relationship between Web pages, but also play an important role in judging the content of Web pages.
Fourth, the calculation of the importance of Web pages
The search engine returned to the user is a list of results related to the user's query. The order of entries in the list is an important issue. With all kinds of users and the natural language style of the query, returning the same list to the same search request is certainly not the satisfaction of all the users submitting the search request (or the highest degree of satisfaction). So the search engine actually pursues is a kind of statistical sense of satisfaction.
The
Is here to briefly explain the so-called "importance" factors that can form in the preprocessing phase. As the name suggests, since it is formed in the preprocessing phase, is not related to user inquiries. How to tell a Web page is more important than another one? According to the evaluation method of the importance of scientific and technological literature, the core idea is "what is important to be quoted". The concept of "reference" is just as good as the HTML hyperlink between pages. In addition, people also note the different characteristics of the Web page and literature, that is, some pages are mainly a large number of external links, its own basic does not have a clear topic content, and some other pages are a large number of other web links. In a sense, this forms a dual relationship that allows people to create another important indicator on a Web page. Some of these indicators can be calculated at the preprocessing stage, while others are calculated at the query stage, but they are all part of the results sorted at the query service stage.