Here is a little I study and development of the search engine in the process of a little learning and experience summary, the article tells the spider, cut words, index, query and other names of the modules of the outline and details, hope to give search engine in the beginner point of a little help, for those who can also bring a little inspiration to help. This is my 2004 study and development of search engine related things when a little summary, may be relatively superficial, recently still engaged in this aspect of the research, relative to this article has some new summary, and so there is time to write an article and share with you.
(Some graphs are not displayed because they can't be copied from word)
To download Word documents please download here: http://www.vchelp.net/itbookreview/view_paper.asp?paper_id=1539
Catalog One, search engine overview. 3 The development history of the search engine. 3 Search engine classification. 4 Search engine composition and working principle. 52, network Spider. 6 Overview. 6 main composition. 6 Key technologies. 8 Experience Summary. 83, Word cutter. 8 Overview. 8 Partitioning principle. 10 Experience Summary. 134, Indexer. 14 Overview. 14 implementation principle. 14 Experience Summary. 165, the query device. 16 Overview. 16 implementation principle. 17 Experience Summary. 196, the System key analysis. 207. Reference documents. +
First, search engine overview The history of search engine development In the early days of Internet development, the site is relatively small, information search is easier. However, with the explosive development of the Internet, ordinary network users want to find the necessary information is just like a needle in a haystack, then to meet the needs of public information retrieval professional search site came into being. The ancestors of modern search engines were the Archie invented by Alan Emtage, a student of the University of Montreal in 1990. Although the World Wide Web has not yet appeared, but the network file transfer is quite frequent, and because a large number of files scattered in the various scattered FTP host, the query is very inconvenient, so Alan Emtage thought of developing a file name can be found in the system, so there is Archie. Archie working principle and the current search engine is very close, it relies on the script program to automatically search for online files, and then index the relevant information for the user to a certain expression query. Inspired by the popularity of Archie, the United States Nevada System Computing Services University developed another very similar search tool in 1993, but the search tool has been able to retrieve web pages in addition to index files. At the time, the word "robot" was very popular among programmers. A computer "robot" (computer Robot) is a software program that can perform a task without interruption at a speed that humans cannot achieve. Because the "robot" program, which is dedicated to retrieving information, crawls across the network like a spider, the search engine's "robot" program is called a "spider" Program. The world's first "robot" program to monitor the scale of the Internet's development is the global wide Web wanderer developed by Matthew Gray. At first it was used only to count the number of servers on the Internet and later developed to be able to retrieve site domain names. In contrast to Wanderer, Martin Koster created the Aliweb in October 1993, which is the HTTP version of Archie. Aliweb does not use "bots", but instead relies on web sites to voluntarily submit information to create its own index of links, similar to what we now know about Yahoo. With the rapid development of the Internet, it becomes more and more difficult to retrieve all the new Web pages, so, on the basis of Matthew Gray's wanderer, some programmers have made some improvements to the traditional "spider" Program working principle. The idea is that since all Web pages may have links to other sites, it is possible to retrieve the entire Internet from the start of tracking a link to a website. By the end of 1993, some search engines based on this principlebegan to emerge, with jumpstation, the World Wide Web Worm (Goto's predecessor, also today Overture), and repository-based software Engineering (rbse Spider is the most prestigious. Jumpstation and WWW Worm, however, simply rank search results in the order in which the search tool finds matching information in the database, so there is no information correlation. Rbse is the first engine to introduce the concept of keyword string matching in the search results arrangement. The earliest search engine in modern sense appeared in the July 1994. At the time, Michael Mauldin the spider program of John Leavitt into its indexing program, creating Lycos that everyone now knows. In April, two PhD students at Stanford (Stanford) University, David Filo and Chinese-American Jerry Yang (Gerry Yang) co-founded the Super Directory index Yahoo, and succeeded in making the concept of search engine popular. Since then the search engine entered a high-speed development period. At present, the Internet has a name of the search engine has reached hundreds of, its retrieval of information is not the same as before. For example, Google, which has recently been in the limelight, has a database of up to 3 billion pages. There are also more than 600 million of Baidu's Web pages. With the rapid expansion of the Internet, a search engine alone has been unable to adapt to the current market conditions, so now search engine between the beginning of a division of cooperation, and has a professional search engine technology and search database service providers. Like foreign Inktomi (has been acquired by Yahoo), it is not a direct user-oriented search engine, but to include Overture (formerly Goto, has been acquired by Yahoo), LookSmart, MSN, Other search engines, including HotBot, provide full-text web search services. The domestic Baidu also belongs to this category, Sohu and Sina use is its technology. So in this sense, they are search engine search engines. Now a reference to search engines, people often think of Google, Baidu, Yahoo, Sohu and so on. So what exactly is a search engine? "Search Engine" is actually a kind of web search tool for people to use keywords to make full-text search on Internet.Search Engine ClassificationSearch engines can be divided into three main ways, namely full text search engine, index search engine (search index/directory) and meta search engine. The full-text search engine is the most widely used one of the most, generally speaking search engine refers to the full-text search engine.Full-Text search engineFull-Text search engine is a veritable search engine, foreign representative of Google, Fast/alltheweb, AltaVista, Inktomi, Teoma, WISEnut, etc., the domestic famous Baidu, China search and so on. They are in a database that is based on information from various websites extracted from the Internet (based on web text), retrieving relevant records that match the user's query criteria, and then returning the results to the user in a certain order, so they are real search engines. From the perspective of the source of search results, the full-text search engine can be subdivided into two, one is to have their own retrieval program (Indexer), commonly known as "Spider" (spider) program or "Robot" (Robot) program, and self-built web database, search results directly from its own database calls, Like the 7 engines mentioned above, the other is renting a database of other engines and arranging search results in a custom format, such as the Lycos engine.Catalog IndexAlthough the directory index has the search function, but in the strict meaning is not the real search engine, is only by the Directory Classification website link list only. Users can simply not use the keyword (Keywords) query, only by the classification directory can also find the information needed. The most representative of the catalog index is the famous Yahoo! Other notable are Open Directory Project (DMOZ), LookSmart, about etc. Domestic Sohu, Sina, NetEase search also belong to this category.metasearch engine (meta search engine) When the meta-search engine accepts a user query request, it also searches on several other engines and returns the results to the user. The famous meta-search engine has InfoSpace, Dogpile, Vivisimo and so on (Meta search engine list), Chinese metadata search engine has a representative search engine. In the search results arrangement, some search results are arranged directly by the source engine, such as Dogpile, and others are rearranged by custom rules, such as Vivisimo. In addition to the above three categories of engines, there are several non-mainstream forms: 1, integrated search engine: such as HotBot in the end of 2002 launched the engine. The engine is similar to a meta search engine, but the difference is that instead of invoking multiple engines to search at the same time, the user chooses from the 4 engines provided, so it's more accurate to call it a "set-up" search engine.
2, Portal search engine: such as AOL search, MSN Search and so on, although providing search services, but they do not have a classification directory and Web database, its search results from other engines completely.
3. Free link list (for all links, FFA): Such sites generally simply scroll through the link entries, with a small number of simple categories, but the size is much smaller than the directory index such as Yahoo. As the above sites provide users with search and query services, we generally refer to them as search engines for convenience.Search engine composition and working principleSearch engine system is generally composed of spiders (also called web crawler), Word cutter, indexer, and query parts. Spiders are responsible for the crawling of web information, usually used in conjunction with the word cutter and indexer, they are responsible for the crawl of the Web content of the word processing and automatic indexing, the establishment of index database. Based on user query criteria, the query retrieves the index database and sorts and sets the results of the search, such as the combination and the intersection operation, then extracts the Web page simple summary information feedback to the query user. The Google search engine is functionally divided into three parts: Web crawling, Tag Introduction library and user query. Web crawling is mainly responsible for crawling Web pages, consisting of URL server, crawler, memory, parser and URL parser, the crawler is the core of this part; the introduction of the library is mainly responsible for the analysis of the content of the Web page, the document is labeled and stored in the database, by the standard and classifier composition, The module involves a lot of files and data, the operation of the bucket is the core of this part; User query is mainly responsible for analyzing the user input of the search expression, matching the relevant documents, the results returned to the user, by the Query and Web page level assessment, which is the core of the page level calculation. The overall system structure is shown in the figure below. Search Engine main workflow is: First from the Spider, the spider program every certain time (like Google is generally 28 days) automatically start and read the URL list on the Web page URL server, according to the depth first or breadth first algorithm, crawl each URL designated site, The crawled page is assigned a unique document ID (DOCID), which is stored in the document database. Compression is generally done before the document database is deposited. and the hyperlinks on the current page are saved to the URL server. At the same time, the cutter and indexer will handle the Web documents that have been crawled, calculate weights according to the position and frequency of the words appearing in the Web page, and then put the results of the cut words into the index database. The entire index database and document database are updated after the whole crawl work and indexing work is complete so that users can query the latest Web page information. The query starts with the word processing of the user's input, retrieves all the records containing the search terms, sorts the query records by calculating the page weights and levels, and then extracts the summary information from the document database to the query user.Second, the network spider OverviewSpider (ie, web spider) is actually an HTTP protocol-based Web application. Web spider is through the Web page of the link to find the Web page, from a page (usually the homepage) of the site, read the content of the page, and extract the other hyperlinks in the Web page, and then through these links to find the next page, so the cycle continues until the site all the pages are crawled until the end. When crawling Web pages, web spiders generally have two strategies: breadth first and depth first. Breadth first means that the spider crawls all the pages that are linked in the Start page, then selects one of the linked pages and continues to crawl all the pages that are linked in the page. This is the most common way, because this method can let the network spider parallel processing, improve its crawl speed. Depth first refers to the network spider from the start page, a link to follow up a link, after processing the line and then into the next Start page, continue to follow the link. One advantage of this approach is that web spiders are easier to design.Main compositionAccording to the crawl process spiders are mainly divided into three functional modules, one is the Web page reading module is mainly used to read the Web content on the remote Web server, and the other is a super-chain analysis module, the module is mainly to analyze the hyperlinks in the Web page, to extract all the hyperlinks on the Web page, into the list of URLs to crawl, Another module is the Content Analysis module, the module is mainly to analyze the content of the Web page, all the hyperlinks in the Web page to leave only the text content. The main work flow of the spider is as follows: First the spider reads the URL list of the crawl site, takes out a site URL, and puts it into the list of URLs that are not visited (Uvurl list), if Uvurl is not empty just take a URL to determine if it has been visited, if not visited, read this page, and carries on the hyperlink analysis and the content analysis, and puts some pages into the document database, and puts some URLs into the visited URL list (vurl list) until the UVRL is empty, then crawls the other sites, loops until all the site URL lists are crawled.
Read the list of site URLs |
Whether the site URL list is empty |
Put the URL into the Uvurl list |
Deposit into Document library |
Delete this URL and join Vurl |