Brief introduction to search engine workflow-search engine technology

Source: Internet
Author: User
The Internet is a treasure house, and search engines are a key to opening it. However, the vast majority of Internet users are insufficient in terms of search engine knowledge and usage skills. According to a foreign survey, about 71% of people are disappointed with the search results. As the second largest service on the Internet, this situation should change.
The rapid development of the Internet has led to explosive growth of online information. The number of web pages in the world exceeds 2 billion, and the number of new web pages is increased by 7.3 million every day. Finding information in the vast ocean of information is as difficult as a haystack. Search engines are the technologies that have emerged to solve this "lost" problem.
The search engine consists of the following three processes:

1. Discover and collect webpage information during interconnection;
2. Extract information and establish an index database;
3. then, the searcher quickly detects documents in the index database based on the query keywords entered by the user, evaluates the relevance between the documents and the query, and sorts the results to be output, return the query result to the user.

Discover and collect webpage information
A high-performance Spider program is required to automatically search for information on the Internet. A typical web spider works by viewing a page and finding relevant information from it. Then, it starts from all links on the page and continues searching for relevant information, and so on, until exhaustion. Web spider requires fast and comprehensive capabilities. To quickly browse the entire Internet, network spider usually uses preemptive multithreading technology to gather information on the Internet. By using preemptive multithreading, you can index a URL-based Web page, start a new thread to follow each new URL link, and index a new URL starting point. Of course, the threads opened on the server cannot expand infinitely. You need to find a balance between the normal operation of the server and the quick collection of web pages. In terms of algorithms, various search engine technology companies may be different, but the goal is to quickly browse web pages and cooperate with subsequent processes. Currently, Baidu's web crawlers use customizable and highly scalable scheduling algorithms to allow the searcher to collect the largest amount of Internet information in a very short time, save the obtained information for indexing and user retrieval.

Index Library creation
It is related to whether users can quickly find the most accurate and extensive information. At the same time, the index library must be created quickly, and the web page information captured by web spiders can be quickly indexed, ensure the timeliness of information. This method is used to evaluate the relevance of a webpage based on content analysis and hyperlink analysis, so that the webpage can be objectively sorted, this guarantees that the search results are consistent with the query strings. In the process of indexing website data, Sina search engine builds an index database based on the appearance of keywords in different locations such as the website title, website description, and website URL, or the quality grade of the website, this ensures that the search result is consistent with the user's query string. Sina search engine adopts the multi-process parallel method for all data in the index database creation process, and uses an incremental method to create an index database for new information, so as to ensure rapid index establishment, update data in a timely manner. In the process of creating an index database, Sina search engine also tracks query strings searched by users and creates a Cache page for query strings with high query frequencies.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.