Search engine principle: Data preprocessing

Last Update:2018-07-24 Source: Internet

Author: User

Tags html text tag lenovo

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Search engine principle, the search engine workflow from the big aspect has three points: data collection, preprocessing, query service, here and everyone to share the data preprocessing, the propose is, which involves a number of professional vocabulary, in my blog is added anchor text, there is no, see not understand can go to see the original.

engine2.gif (30.75 KB, download number: 40)

Download attachments

Data preprocessing of search engine

Uploaded 10 hours ago

In our "Data preprocessing" is mainly contains four aspects: keyword extraction, "Mirror page" and "reprint page" elimination, Link analysis and page importance of the calculation.

　　 keyword extraction:
1) in each chapter of the page, contains a large number of content unrelated to the subject matter, such as the copyright description and so on, the task of extracting keywords, is to extract the content of the page source file contains the keywords. Extraction method: General similar and cut words, the content into a plurality of words into an array, and then take out "in" "" and other meaningless phrases, determine the final keyword. (bo main lenovo: keyword density, keyword bold, directional anchor text is more of this reason, convenient search engine easier to judge keywords)
In the following chapters will also be mentioned in the Docview model will be more detailed explanation, in the keyword extraction before the page purification and other steps, in order to consider the book editing, here unknown solution, interested can click link jump to view: Docview model, Web purification;

　　 Duplicate or reprint the elimination of the webpage:
1) Skynet's 2003 statistical findings: the average Web page repetition rate of 4, to the current 2015, this number must have broken 10. For the network name, has more access to useful information opportunities for search engines, waste a lot of time to collect Web pages, as well as Internet broadband resources. The concrete realization method, later again to speak.

　　 Link Analysis:
1) Link analysis refers to two concepts, word frequency (TF): The keyword in the keyword extraction after the occurrence of the keyword collection frequency;
2) file Frequency (DF): How often the keyword appears in all files, and in all files, how many files the keyword appears in;
3) search engine through the HTML text tag, to determine the importance of keywords (bo main Lenovo:
　　 How important the page is calculated:
1) The search engine needs to display the results of the user index to the customer in the form of a list, and to meet the user's search needs in the display, so the concept of "page importance" appears.
2) Methods of judging importance: the core of people's assessment of the importance of references is that "the most cited is the most important". This way, exactly in the HTML hypertext link perfect embodiment, Google's PR value (referring to the page Summary page and the page to refer to the importance of this page) is the perfect display (bo main lenovo: Hair outside the chain is the perfect embodiment of the algorithm). (PageRank algorithm)
3) and the 2nd appearance of the different place is that some pages are a large number of points to other pages, some pages are heavily referenced by other pages, forming a dual relationship, so the hits algorithm appears. (hits algorithm)

seo-ganyan.jpg (10.24 KB, download number: 43)

Download attachments

Seo

10 hours ago
Some noun description:

"Inverted text: Using the key contained in the document (the page already collected) As the index of the index, the document is the landing page (the target document), common, like paper books, index is the article keyword, the specific content of the book or the page is the index target page.

Mirror Web page: The content of the Web page is exactly the same, without any modification
Reprint Web page: The main content is basically the same, but a little more editing information

Hits algorithm: Simple introduction, in the hits algorithm, there are two pages authority (authoritative) page and the hub (Table of Contents) page, for authority page A, a page to the Hub page H page More, then the higher the quality of a page, the same hub page H points to the number of authority page A, the higher the quality, the higher the quality of the H-page.

Chen Chen in the final summary, in the search engine principle of this book, in the introduction of data preprocessing this piece, contains four aspects, in Chen Chen View, link analysis is used to judge the importance of the page, so can be divided into a class, together is three aspects, In a word: Repeat or reprint the page first to eliminate, and then extract the keywords, to DF,TF, links, and algorithms to determine the importance of the page.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More