Wandering Ghosts: Search engine Workflow

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

The internet is a treasure trove, search engine is a key to open the Treasure house. However, the vast majority of netizens in the search engine related knowledge and the use of skills are insufficient. A survey abroad showed that about 71% of the respondents felt different degrees of disappointment at the results of the search. As the second largest service in the Internet, this situation should change. The rapid development of Internet has led to the explosive growth of information on the Internet. The world's current Web page is over 2 billion, adding 7.3 million new pages a day. Finding information in such a vast ocean of information is as difficult as looking for a needle in a haystack. Search engines are the technology that comes with solving this "trek" problem. The search engine's work includes the following three processes:

1. Find and collect web information in the interconnection;

2. Extracting information and organizing indexing library;

3. Again by the query based on user input word, in the index library quickly check out the document, document and query relevance evaluation, the output will be sorted, and the results of the query returned to the user.

Discover and collect web information

A high-performance "web Spider" Program (Spider) is needed to automatically search for information on the Internet. A typical web spider works by looking at a page and finding the relevant information, and then starting from all the links in the page, continuing to look for relevant information, and so on, and so on. Web spiders require fast, comprehensive. In order to realize its fast browsing the whole Internet, web spiders usually use preemptive multithread technology to gather information on the Internet. Using preemptive multithreading, you can index a Web page based on URL links, start a new thread followed by each new URL link, and index a new URL starting point. Of course, the thread on the server can not expand indefinitely, need to be in the normal operation of the server and the rapid collection of Web pages to find a balance between the point. Different search engine technology companies may be different in the algorithm, but the goal is to quickly browse the Web page and follow the process to match. At present, the domestic search engine technology companies, such as Baidu's network spiders use a customizable, highly scalable scheduling algorithm so that the searcher can in a very short period of time to collect the largest number of Internet information, and the information obtained to save for the establishment of index library and user retrieval.

Establishment of Index Library

Related to the user can most quickly find the most accurate, the most extensive information, at the same time the establishment of the index library must be rapid, web spiders grasp the Web information very quickly to establish an index to ensure the timeliness of information. Based on the content analysis of Web pages and the combination of hyper-chain analysis method to evaluate the relevance of the Web pages can be objectively sorted, so that a great deal of the results of the search and the user's query string consistent. Sina search engine site data in the process of indexing is taken in accordance with the keywords in the site title, site description, site URL, such as the appearance of different locations or the quality of the site, such as the establishment of the index library, so as to ensure that the results of the search and the user's query string consistent. Sina search engine in the process of establishing the index library, to all the data in a process-by-parallel way, the new information to take an incremental approach to build the index library, so as to ensure rapid indexing, so that the data can be updated timely. Sina search engine in the establishment of the index library is also the user search for the query string to track, and query a high frequency of query string to create a cache page.

User Retrieval process

This is the first two procedures of the test, check whether the search engine can give the most accurate, the most extensive information, test the search engine can quickly give the user most want to get the information. For Web site data retrieval, Sina search engine using client/server structure, the way of multiple processes in the index library to retrieve, greatly reduce the user's waiting time, and the user query peak when the server's burden will not be too high (the average retrieval time in 0.3 seconds or so). For Web page information retrieval, as the home of many portal Web search technology provider Baidu Company's search engine using advanced multithreading technology, the use of efficient search algorithm and stable UNIX platform, so can greatly shorten the response time for user search requests. As one of HC I series application software products, i-search2000 adopts super large dynamic caching technology, so that the coverage of the first level of response reaches more than 75%, and the unique self-learning ability can automatically increase the coverage of two level response to more than 20%.

SEO Exchange Group: 60477748 reprint Please specify: http://www.idc088.com

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.