section 1 basic working mechanism of search engines
data centers of large Internet search engines generally run thousands or even hundreds of thousands of computers, in addition, dozens of machines are added to the Computer Cluster every day to keep pace with the development of the network. The collection machine automatically collects web page information, with an average speed of dozens of web pages per second. the retrieval machine provides a fault-tolerant and scalable system architecture to respond to queries from tens of millions or even hundreds of millions of users each day. Enterprise search engines can be deployed from a single computer to a computer cluster based on different application scales.
the general process of a search engine is to first collect web pages on the Internet, then pre-process the collected web pages, and create a web index database, responds to Users' Query requests in real time, sorts the searched results according to certain rules, and returns them to users. An important function of search engines is to provide full-text retrieval for text information on the Internet.
Figure 1 workflow of the search engine
the search engine uses the client program receives retrieval requests from users, currently, the most common client program is a browser. In fact, it can also be a much simpler network application developed by users. User-input retrieval requests are generally keywords or multiple keywords connected using logical symbols. The search server converts the search keywords to wordid based on the system keyword dictionary, and then in the index database (Inverted Files) to obtain the docid list, scan the objects in the docid list and match the wordid. Extract the qualified webpage, and then calculate the relevance between the webpage and the keyword, return the results of the first K articles (different search engines have different search results on each page) to the user based on the correlation value, as shown in process 1.
Figure 2 describes the general search engine system architecture, including the page collector, indexer, searcher, and index file, the following describes the implementation of the main functions.
figure 2 Relationship between various search engine components
Figure 3 search engine capture webpage Process
1. Collector
by using a program robot (also called Spider), a collector is used to roam the Internet and discover and collect information. It collects diverse types of information, including HTML pages, XML documents, newsgroup articles , FTP files, word processing documents, and multimedia information. Searcher is a computer program that uses distributed and parallel processing technologies to improve information discovery and updating efficiency. Commercial search engine collectors can collect millions or more web pages every day. Searchers generally need to run constantly and collect as many types of new information as possible and as quickly as possible on the Internet. Because the information on the Internet is updated quickly, it is necessary to regularly update the old information that has been collected to avoid dead links and invalid links. In addition, because Web information is dynamically changing, the collectors, analyzer, and indexers must regularly update the database. The update cycle is usually about weeks or months. The larger the index database, the more difficult it is to update.
there is too much information on the Internet. Even powerful collectors cannot collect all the information on the Internet. Therefore, the collector uses a certain search policy to traverse the Internet and download documents. For example, the collector generally uses a search policy dominated by a breadth-first search policy and supplemented by a linear search policy.
when the Collector is implemented, a hyperchain queue or stack is maintained in the system, which contains some starting URLs (
understand imaging dmoz, Yahoo directory Google sitemap, etc ), the collector downloads the corresponding page from these URLs and extracts new hyperchains from them to the queue or stack. The above process repeats the queue until the stack is empty. To improve efficiency, the search engine divides the web space according to the domain name, IP address or national domain name, and uses multiple collectors to work in parallel, so that each searcher is responsible for the search of a subspace. To facilitate future service expansion, the collector should be able to change the search range.
1. linear search strategy
the basic idea of linear search is to start from a starting IP address, search for the information in each subsequent IP Address by increasing the IP address, regardless of the hyperlink address pointing to other web sites in the HTML file of each site. This policy is not applicable to large-scale searches (mainly because IP addresses may be dynamic), but it can be used for a small range of comprehensive searches, the collector using this policy can find the sources of new HTML files that are rarely referenced or not referenced by other HTML files.
2. Deep priority collection policy
The deep priority collection policy is a method used by many early developers to achieve the leaf node of the searched structure. The deep search takes precedence over the hyperlinks on the HTML file, returns to the HTML file of the previous contact, and continues to select other hyperlinks in the HTML file. If no other superlinks are available, the search is completed. Deep preference search is suitable for traversing a specified site or a deep Nested HTML file set. However, for large-scale search, because the web structure is quite deep, it may always be unavailable.
3. breadth-first collection strategy
The breadth-first collection strategy is to first search for the content in the same layer, and then continue to search for the next layer. If an HTML file contains three hyperchains, select one of them and process the corresponding HTML file (Note: The processing file here refers to retrieving the file content, file), then return and select the second hyperlink of the first web page, process the corresponding HTML file, and then return. Once all hyperchains on the same layer have been processed, you can start to search for other hyperchains in the HTML file you have just processed. (Definition of breadth link)
This ensures the first processing of the shallow layer. When an endless deep branch is encountered, it will not be trapped in. Width-first collection is easy to implement and widely used, but it takes a long time to reach the deep HTML file.
4. Collect collection policies
Some web pages can be collected through user submission. For example, some commercial websites send an application to search engines for indexing, the collector can directly collect the webpage information of the submitted website and add it to the index database of the search engine.
2. Analyzer
You must first analyze the web page information collected by the Collector or the downloaded documents for indexing. Document analysis techniques generally include word segmentation (some extract words only from some parts of the document, such as AltaVista), filtering (using the stopword table Stoplist), conversion (some work on the entry to convert the singular and plural numbers, remove suffixes, and convert synonyms ), these technologies are often closely related to specific languages and system index models.
Iii. Indexer
The indexer analyzes and processes the information searched by the searcher, and extracts the index items from them to represent the document and the index table of the generated document library. There are two types of index items: Metadata index items and Content Index items, for example, the author name, URL, Update Time, encoding, length, and link popularity are metadata index items. Content Index items are used to reflect the content of documents, such as keywords and their weights, phrases, and words. Content Index items can be divided into single index items and multi-index items (or phrase index items. A single index is an English word, which is easy to extract because there are natural separators (spaces) between words. Words must be separated in Chinese and other consecutive languages. In a search engine, a single index item is usually assigned a weight value to indicate that the index item is differentiated from the document and used to calculate the relevance of the query results. Generally, statistical methods, information theory methods, and probability methods are used. The methods for extracting phrase index items include statistical method, probability method, and linguistic method.
To quickly find specific information, it is a common method to create an index database. This means that documents are represented as a convenient retrieval method and stored in the index database. The format of the index database is dependent on the index mechanism and Algorithm Special Data storage format. Index quality is one of the key factors for the success of Web information retrieval systems. A good index model should be easy to implement and maintain, with fast retrieval speed and low space requirements. Search engines generally use the index model in traditional information retrieval, including inverted documents, vector space models, and probability models. For example, in the vector space index model, each document D is represented as a normalized vector V (d) = (T1, W1 (d )... Ti, W1 (d )... TN, Wn (d )). Among them, Ti is the entry term, WI (d) is the weight of Ti in D, and is generally defined as the function of Ti appearing frequency TFI (d) in D.
The output of the indexer is an index table. It generally uses an inversion list, that is, the index item searches for the corresponding documents. The index table may also record the location where the index item appears in the document, so that the searcher can calculate the adjacent or close relationship (proximity) between the index items ). The indexer can use a centralized or distributed Index algorithm. When the data volume is large, you must implement the real-time index (instant indexing). Otherwise, you will not be able to keep up with the dramatic increase in the amount of information. The Index algorithm has a great impact on the performance of the index tool (such as the response speed during large-scale peak queries. The validity of a search engine depends largely on the index quality.
Iv. searcher
The searcher function is to quickly check documents in the index database based on user queries, evaluate the relevance between documents and queries, and sort the results to be output, and implement a user-related feedback mechanism. Information Retrieval Models commonly used by the searcher include a set of theoretical models, algebra models, probability models, and hybrid models. You can query any word in text information, whether in the title or body.
The searcher finds the documents related to the user query request from the index and processes the user query request by familiarizing with the analysis index document. For example, in the vector space index model, the user's query Q is first expressed as a fan vector V (q) = (T1, W1 (Q );...; Ti, WI (Q );...; TN, Wn (q), and then calculate the relevance between each document in the user query and index database according to some method. The relevance can be expressed as the query vector V (q) the cosine of the angle between V (d) and the vector V (D) of the document? File Content, file response links and quality) All documents greater than the threshold are arranged in descending order of relevance and returned to the user. Of course, the relevance judgment of the search engine is not necessarily consistent with the user's needs.
V. User Interfaces
The function of a user interface is to provide users with a visual query input and result output interface, which facilitates users to enter query conditions, display query results, and provide user relevance feedback mechanisms, its main purpose is to make it easier for users to use the search engine and obtain effective information from the search engine efficiently and in multiple ways. The Design and Implementation of user interfaces must be based on the theory and method of human-computer interaction to adapt to human thinking and usage habits.
On the query page, you can use the search engine's query syntax to set the terms to be searched and various simple or advanced search conditions. Simple interfaces only provide text boxes for users to enter query strings. complex interfaces allow users to restrict query conditions, such as logical operations (with, or, not) and close relationships (adjacent and near), domain name range (such as Edu, com), location (such as title, content), time information, length information, and so on. At present, some companies and organizations are considering developing standards for query options.
On the query output page, search engine displays the search result as a linear document list, which contains the document title, abstract, snapshot, and hyperlink information. Because the relevant documents and irrelevant documents in the search results are mixed, You need to browse them one by one to find the required documents.