Topic Center

Contact Sales

Home > Others

Search engine principle: Data preprocessing

Last Update:2018-07-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Search engine principle, the search engine workflow from the big aspect has three points: data collection, preprocessing, query services, here and you share data preprocessing, the proposed explanation is, which involves some professional vocabulary, in my blog is added anchor text, there is no, do not understand can see the original.

engine2.gif (30.75 KB, download times: 40)

Download attachments

Data preprocessing of search engine

Uploaded 10 hours ago.

In our "Data preprocessing" is mainly composed of four aspects: keyword extraction, "Mirror page" and "reprint page" of the Elimination of link analysis and the importance of the page calculation.

　　 keyword extraction:
1 in each chapter of the page, contains a lot of content unrelated to the subject matter, such as copyright, and so on, keyword extraction task, is to extract the content of the Web page source file contains the keywords. Extraction methods: Generally similar and cut the word, cut the content into a number of words in the array, and then remove the "" in the "" and other meaningless phrases, determine the final keyword. (Bo Main association: keyword density, keyword bold, directional anchor text is more this reason to appear, convenient search engine easier to judge keywords)
In the following chapters will also be mentioned in the Docview model will be more detailed explanation, before the keyword extraction There are several steps, such as Web page purification, for the editing order of books, here unknown solution, interested can click the link jump view: Docview model, Web purification;

　　 repeat or reprint the elimination of the page:
1 The 2003 statistic of Skynet: the average repetition rate of the Web page is 4, and by the current 2015, this number must have been broken by 10. For the network name, have more access to useful information, for search engines, waste a lot of time to collect Web pages, as well as network broadband resources. The concrete realization method, later speak again.

　　 Link Analysis:
1 link analysis mentioned two concepts, Word frequency (TF): The keyword in the keyword extraction after the occurrence of the keyword set frequency;
2 File Frequency (DF): The frequency of the keyword in all files, in all files, the keyword in the number of files appear;
3 Search engines can use HTML text tags to determine the importance of keywords (bo main Lenovo:
　　 computing the importance of Web pages:
1 The search engine needs to display the results of the user index to the customer in the form of a list, and to satisfy the user's search demand in the display, so the concept of "page importance" appears.
2 The method of determining importance: The core of people's assessment of the importance of literature is that the most cited is the most important. This way, just in the HTML of Hypertext link perfect embodiment, Google's PR value (referencing the page of the page overview and quote the page important degree) is the perfect Show (Bo Main association: Hair outside the chain is the perfect embodiment of the algorithm). (PageRank algorithm)
3 and the 2nd appearance of the different place is that some of the pages are a lot of points to other pages, some pages by a large number of other pages cited, forming a dual relationship, so hits algorithm appears. (hits algorithm)

seo-ganyan.jpg (10.24 KB, download times: 43)

Download attachments

Seo

10 hours ago
Some of the terms:

"Inverted text: Using the document (already collected Web pages) included in the key Word as index, document as index landing page (target document), common, just like paper books, index is the article keyword, the specific content of the book or the page is the index target page.

Mirrored Web page: Web page content is the same, did not make any changes
Reprint page: The main content is basically the same, but a small amount of editing information

Hits algorithm: A simple introduction, in the hits algorithm, there are two kinds of page authority (authoritative) page and hub (Table of Contents) page, for authority page A, point a page of the Hub page H page more, then a page of the higher quality, the same hub page H point to the number of authority page A, the higher the quality, the H page quality is higher.

Chen in the final summary, in the search engine principle of this book, in the introduction of data preprocessing this piece, contains four aspects, in Chen View, link analysis is used to determine the importance of the page, so you can be divided into a class, together is three aspects, In a word: Repeat or reprint page first elimination, and then extract keywords, to DF,TF, links, and algorithms to determine the importance of the page.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Search engine principle: Data preprocessing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support