Chen Huaiyi: A summary of the three-segment work flow of search engine

Source: Internet
Author: User
Tags key words new features query sort

The problem the search engine has to deal with is to return a list of page information that matches the user's query within an acceptable time list, which includes three parts: title, URL, description, or summary.

Modern large-scale search engine generally uses three-stage workflow, namely: Web Page collection, preprocessing, query services.

Now I'm going to put this three-point simple:

  First, Web page collection

Search engine is through the crawler to collect Web pages in the Internet, put into the database, but this can not be a user to submit queries to crawl, but in advance to collect a number of pages, you can take the Web page collection as a map, the collection process from the given starting URL set S start, along the links in these pages, Follow the first depth or width of a certain strategy traversal, and constantly remove the URL from S, download the corresponding page, resolve the hyperlink URL in the page to see if it has been visited, or have not visited those URLs to join the set S. We can collect it regularly, increment it, or crawl it by the way the user submits it autonomously. and maintain this batch of Web pages. This maintenance is to be able to discover the new features of the Web page in time, to collect new pages, changed pages, or no longer exist pages.

 Second, pretreatment

Pretreatment mainly includes four aspects: the extraction of keywords, the elimination of mirrored or reproduced Web pages, link analysis and the calculation of the importance of Web pages.

1. Extraction of key words

As a basic task in the preprocessing phase, it is to extract the key words from the content section of the Web page source file. For Chinese, is based on a dictionary, with a so-called "word-cutting software", from the Web page text to remove the words contained in the dictionary, after that, a page is mainly composed of a group of words to represent, p={a,b,c,...... D}. Generally speaking, we get a lot of words, The same word may appear several times in a Web page. Then we have to remove the words "stop words", such as ", in, yes". It then calculates statistics such as word frequency (TF) and document frequency (DF), indicating the relative importance of words in a document and the relevance of certain content.

2. The elimination of mirrored or reproduced Web pages

On the web, there is a lot of duplication of information, this information may be negative for search engines, because of the need to consume machine time and bandwidth resources, and meaningless consumption of computer display resources, but also can bring users complaints, so many repeat, give me one is enough. Therefore, the search engine to eliminate these duplicate information is also a very important task in preprocessing.

3. Link analysis

Search engine In addition to the content of the analysis, and the most important also need to analyze the link, link information not only gives the relationship between the page, but also to judge the content of the Web page plays a very important role. The internal links and external links in the Web page have a great impact on the ranking of the sites.

4. Computing the importance of Web pages

Search engine returned to the user is a query related to the list of results, the list of item order is a very important issue. Therefore, search engines must provide a statistical sort of results to the user, but not to all users can provide satisfactory results. How to evaluate the weight of the web, is the search engine most need to solve the problem, such as Google's PR algorithm, where the idea is that "the more cited is important", and hits algorithm and so on. Some of these algorithms are calculated at the preprocessing stage, and some are computed at the query service stage, which results in the optimal sorting result.

  Third, inquiry service

When you start with a set of S, you get an internal display of a subset of S, which includes at least several things: the original Web page document, the URL and the title, the number, the set of important keywords, including the location, and other indicators. And the system key words of the overall set and numbering together constitute a inverted file structure, so that once a keyword input, you can immediately give the document number of the set output. There are three main aspects: Query method and match, result order, document summary.

1. Query method and Match

A user general query is "input what you want", this is a vague version of the search engine, it may not know what you really want, so it is to want to include the word or phrase in the Web page, but also to the user query words or phrases for participle, forming a Q, Each of his elements corresponds to an inverted table in the inverted file, a collection of document numbers. This enables the matching of queries and documents.

2. Order of results

To provide users with the highest quality web information, you must sort the results, such as Google's PageRank algorithm, Kleinberg's hits algorithm, etc., is the current search engine to give the query results of the ranking of the main methods.

3. Document Summary

The result given by the search engine is an ordered list of entries, each entry contains a title, URL, summary, where the summary needs to be generated from the body of the page, which can be summed up in two ways, a static way, from the body to extract some of the text, such as the beginning of the text of the 512 bytes, Or the first sentence of each paragraph together, but this has a disadvantage is that the query may not be related to the word. So the second way is dynamic summary, according to the query of the word in the document location, extract the surrounding text, display the query word is lit. This approach is currently used by most search engines, in order to ensure the efficiency of the query, it is necessary to remember in the preprocessing phase of the word in the document where each keyword appears.

This article from www.hbdsz.com feeds, reprint please indicate the source!



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.