Design and implementation analysis of Web search engine

Source: Internet
Author: User
Tags format file size ftp ftp site header nntp unique id ftp protocol
----One, Introduction

----with the rapid development of the Internet, people rely more and more on the network to find the information they need, however, due to the number of information sources on the Internet, which is what we often call "Rich Data, Poor information." So how to effectively find the information we need is a key issue. To solve this problem, the search engine was born.

----Now on the web search engine has also been a lot, more famous are AltaVista, Yahoo, InfoSeek, MetaCrawler, Savvysearch and so on. Domestic also set up a lot of search engines, such as: Sohu, Sina, Polaris and so on, of course, because they have not been established for a long time, in the information search for the full rate and the rate of registration to be improved and improved.

----Alta Vista is a fast search engine due to its powerful hardware configuration, enabling it to do its complex queries. It is mainly based on keywords to query, it roams the field of the Web and Usenet. Supports "and", "or" and "not" for Boolean queries, along with the most closely positioned "NEAR", allowing wildcard and "backward" searches (e.g., you can find all Web sites linked to a page). You can decide if you want to add weights to the search phrases and find them in any part of the document. The advantage of being able to make phrase queries rather than simple word queries is obvious, for example, we want to find a phrase "to being or not", and if we just break them into words, these words belong to stop word, so that the query doesn't have any results, But if you look at it as a whole, it's easy to return some of the results, such as the information about Hamlet or Shakespeare, and so on. The system's scoring of Web pages on query results is based on how much of your search phrases are included in the Web page, where they are located in the document, and the distance between the search phrase within the document. You can also translate the resulting search results into other languages.

----Exite is called a "smart" search engine because it builds a concept based index. Of course, its so-called "intelligence" is based on the flexible application of probability statistics. It can be indexed at the same time based on concepts and keywords. It is able to index web,usenet and classify ads. Boolean operations such as "and", "or", "not" are supported, and symbols "+" and "-" can also be used. The disadvantage is that the page size and format are not specified in the returned query results.

----InfoSeek is a simple but powerful index, one of the advantages of which is that it has a scalable taxonomy for topic-oriented search. You can cross-reference your search phrases with the subject phrases of similar categories, and those theme phrases are automatically added to your query. Make your search a better topic for relevance. It also supports the query of images. It can roam web,usenet,usenet FAQs and so on. Boolean operations are not supported, but you can use the symbol "+" and "-" (equivalent to "and" and "not")

----Yahoo actually can't be called a search engine site, but it provides a layered theme index that allows you to move from a common theme to a specific theme, and Yahoo has an effective web organization and classification. For example, you want to create a Web page, but you do not know how to do, in order to find information on Yahoo to build a Web page, you can first choose a theme on Yahoo: computer and the Internet, and then under this topic, you can find a number of subtopics, such as: Web page production, CGI programming, java,html, web design, and so on, select a related subtopics that you're looking for, and eventually you'll get links to all the pages associated with that child theme. That is, if you have a clear idea of which topic to look for, the method of directory query is more accurate than the general search engine. You can search Yahoo's index, but in fact, you're not searching the entire web. But Yahoo offers the option to search other search engines at the same time: Alta Vista. But be aware that Yahoo is actually just a small part of the web to classify and organize, and its effectiveness is not very good.

The basic principle of----search engine is to crawl on the Web page regularly through the network robot, then discover the new webpage, bring it back to the local database, the user's query request can be obtained by querying the local database. Yahoo will find about 5 million new pages a day.

----Search engine Implementation mechanism in general there are two kinds, one is to index the Web page by hand, such as Yahoo's Web page is implemented by manual classification, its disadvantage is that the web coverage is relatively low, while not guarantee the latest information. Query matching is done through the keyword written by the user and the description and title of the Web page, not through the full text. The second is the automatic indexing of web pages, like AltaVista, which is implemented entirely through automatic indexing. This kind of automatic document classification can be realized, in fact, the technology of information extraction is used. However, classification accuracy may not be as good as manual classification.

----Search engines generally have a robot regular visit to some sites to check the changes in these sites, while looking for new sites. The general site has a robot.txt file that describes the areas that the server does not want Robot to visit, and Robot must comply with this rule. If it is an automatic index, robot after the page, you need to index the page according to its content, according to its key words in the case of it into a certain category. The information of the page is saved in the form of metadata, typical metadata includes the title, IP address, a brief introduction to the page, keyword or index phrase, the size of the file, and the date of the last update. Although metadata has certain standards, many sites use their own templates. Document extraction mechanism and indexing strategy have a great relationship with the effectiveness of web search engine. Advanced search options generally include Boolean methods, or phrase matching and natural language processing. The results of a query are submitted to the user according to the extraction mechanism and are divided into different levels. The most relevant is put on the front. The metadata for each extracted document is displayed to the user. Also include the URL address where the document resides.

----There are a number of special engine on a topic, they only search and deal with the content of a topic, so that the rate and accuracy of information is relatively high.

----at the same time, there is a class of search engine, it does not robot to regularly collect Web pages. Like Savvysearch and MetaCrawler, the search function is realized by sending inquiries to multiple search engines at the same time and returning the results synthetically to the user. Of course, in fact, like Savvysearch can analyze and compare the functions of various search engines, according to different user queries submitted to different search engines for processing, of course, users can also specify which search engine to use.

----An excellent search engine must deal with the following issues: 1 Web page Classification 2 natural Language Processing 3 search strategy scheduling and collaboration 4 targeting specific user searches. So many search engines have used some artificial intelligence techniques to solve these problems in varying degrees.

----Two, the realization description of network spider

----There are a lot of articles about Web engine introduction and analysis, but very few of them to do a detailed description of the implementation, here we mainly introduce a basic functionality of the Web engine implementation. In this paper, we describe the process of how the Web engine collects Web pages and puts them in a database in the form of a class C + + language. At the same time, it also describes how to query the database according to the user input keyword and get the related webpage process.

----2.1 Database structure

----First, we're going to build a database table to store the pages we get. Here is a general need to create the following table:

----1. The establishment of the dictionary table, in fact, is used to represent a document with meaningful words in the document and the frequency with which they appear.

----the table (WORDDICTIONARYTBL) consists primarily of three fields, mainly for storing words related to a Web page.

URL_ID Unique ID number for each URL
Word that passes through the stem in the URL
Intag the number of occurrences of the word in the page

----2. Tables for storing each URL information

----The main key fields in this table (URLTBL) are:

REC_ID The unique ID number of each record
Status gets the state of the URL's content, such as http_status_timeout representation
Maximum allowable timeout for downloading Web pages
String name of URL URL
Type of Content_Type content
Last_modified the latest change time
Title of the URL
DocSize the file size of the URL
Last_index_time time of last index
Next_index_time the time of the next index
tag for a Web page, used to indicate its type, such as: is text, or HTML,
Or a picture, whatever.
The number of times the hops failed to get the file
Keywords for web pages, and the keywords associated with the page
Description for a Web page, a description of the content of a Web page
Language used by the Lang document

----3. Because there are many words in the Web page that are prepositions and modal particles or commonly used words, they do not have much meaning in themselves. For example: English about,in,at,we,this and so on. In Chinese, such as "and", "Together", "about" and so on. We uniformly call them stop word. So we're going to create a table to include all these stop words. The table (STOPWORDTBL) has two main fields.
Word char (32) indicates those stop words
Lang char (2) indicates the language used

----4. We're going to build a table about robot, and as we said earlier, all Web sites generally have a robot.txt file to represent the permissions that robot on the network can access. The table (ROBOTTBL) mainly has the following fields.
Hostinfo Web site host information
Path does not allow robot access to directories

----5. Create a table (FORBIDDENWWWTBL) of those pages that we need to screen (for example, sites that are unhealthy or not necessary to search), and the main field is the URL of the page.

----6. In addition, we need to create a table (FILETYPETBL) of the file type we want to get, for example, for a simple web search engine, we may only need to get a type file with a suffix of. html,htm,.shtml and TXT. The rest of us simply ignore them. The primary field is the type and description of the file.

----The contents of the table on the stop word is that we want to implement the statistical results of the various languages, and put those words in the small meaning. The contents of a table about a document's words, URLs, and robot are dynamically incremented when you get a Web page.

----2.2 Specific Web page acquisition algorithm description

The steps to get----specific Web pages are as follows:

----We can set the maximum number of threads that our search program can open, and then these threads can search the Web at the same time, and they find pages that need to be updated based on the information already available in the database (how to determine which pages need to be updated is a worthwhile research process, Now there are many heuristic and intelligent algorithms, basically based on statistical rules for modeling. The simplest of course is to set a time range in which the previous pages are searched again, and then determine if those pages are in the screened table, and if so, delete the record from the table about the URL. Otherwise, we will go to the corresponding WWW site to get the URL specified file (here need to note that according to the characteristics of different URLs, need to use different protocols, such as FTP site to use FTP protocol, HTTP site to use HTTP protocol, news site to adopt NNTP protocol, etc.) In fact, first we get the header information about the page, if the page's latest modification time is the same as the time we recently extracted, which means that the content of the Web page is not updated, we don't have to get its content, just change the last time it was updated to the current time. If the page has recently been modified, we need to get the page and analyze its content, mainly to include links to it, add them to the appropriate database, and to determine the various other documents contained in the Web page, such as text files, graphic files, Whether sound files and other multimedia files are the files we need, and if so, add them to our response database. At the same time, according to the content of the Web page to extract all the meaningful words and their number of occurrences, put in the corresponding database. To better describe this process, we look at the main objects and data structures associated with this process. The object is mainly for three levels. The first tier is for the WWW server, the second tier is for each page, and the third is the Full-text index for each page.

----2.3 and implementation-related main class objects and functionality description The following structure is for a site.

Class Cserver {
The main properties are:
Char *url; URL name of www site
Char *proxy; Name of the agent used
Char *basic_auth; Perform basic HTTP Authentication
int proxy_port; Port number of the agent
int period; Cycle of indexing again
int net_errors; Number of times the network connection is not connected
int max_net_errors; Maximum network errors that can be allowed
int read_timeout; Maximum latency allowed for download files
int maxhops; Indicates the depth at which the URL can be maximum jump
int userobots; Whether to comply with the conventions in Robot.txt
int bodyweight; The weight of the word between < body >....</body >
int titleweight; The weight of the word in < title >....</title >
int urlweight; The weight of the word in the URL of the document
int descweight;//in < META
Name= "Description" content= "..." the weight of the word between >
int keywordweight; In < META name= "Keywords" content= "..." >
The weight of the word between

----Main methods are:
Findserver ()//To find out if the server exists and can connect
Filldefaultattribute ()//To fill out the default genus for all WWW servers;

The member variable in the above object is the setting of the parameter associated with a site, we have a default setting for all sites, but we can make some special settings for some sites. These settings can be set in the configuration file.
----The following are the main data members about the structure of the document:

Class cnetdocument
The main properties are:
int url_id; The ID number of the URL
int status; Get the status of the document
int size; Size of document
int tag; A label associated with the document that indicates that the document is
Html,text or other type
int hops; Number of URL jumps
Char *url; The name of the URL associated with the document
Char *content_type; The type of the content
Char *last_modified; Last Update time
Char *title; The title of the document
Char *last_index_time; Time of last Index
Char *next_index_time; Time of Next Index
Char *keywords; Keywords in this document
Char *description; The description of the document

The main methods are:
Filldocinfo (...)//According to the database, get the relevant information of the document
Addherf (...) Add a new link to the Web site
DeleteUrl (...) Delete an existing URL
Cangetthisurl (...)//depending on configuration decide whether to get the page
The following three methods use different protocols to obtain documents based on different URLs
Nntpget (...)
Ftpget (...)
HttpGet (...)
Parsehead (...) If the HTTP protocol is available, analyze header information
Parsemainbody (...) Analyze the subject of the obtained document
Serverresponsetype (...)//Get server-side response message
UPDATEURLDB (...)//Updated data warehousing
} ;

----In fact, when we want to extract a Web page, we have to create a Cnetdocument object, and then analyze the page, put the relevant content into the Cnetdocument member variable. The following are the main data members about the structure of the Full-text indexing of the page:
Class Cindexer {
The main properties are:
Char *url; The name of the URL associated with the document we want to process
int mwords; We have set the maximum number of words for a page in advance
int nwords; The actual number of words obtained
int swords; The number of words we have sorted
WORD *word; The contents of all the words
Char *buf; The space that we allocate for the document
The main methods are:
Initindexer (...) Make initial settings and assignments
Parsegetfile (...) Full-text indexing of the resulting web pages
Addword (...) Add the words that can be indexed in a Web page to a word array
Intodb (...)//About Web page full-text indexing information warehousing
};

Before----make a Web page extraction, we want to create a Cindexer object, which is primarily used to index the full-text of a Web page. In general, we only have two types of URLs for Full-text indexing, one is text/html, and the other is Text/plain. Where Word's data structure is as follows:
typedef struct WORD_STRUCT {
int count; The number of times the word appears
int code; The normal form of the word,
For example, the word may be encouraging, and its normal form should be
Encourage, this is actually a kind of stem to the word.
We only take the main part of the word.
Char *word; The contents of the word
} WORD;

----The following structure is a data structure related to some linked objects in the Web page
typedef struct HREF_STRUCT {
Char *href; The name of the link
int hops; Number of jumps occurred
int stored; Whether it has been stored in the database
The HREF;


----All the updated and newly generated URLs are placed in this structure, and are stored in the database one at a time when the number exceeds a certain range.
----A data structure about URLs is as follows:

typedef struct URL {
Char *schema; Indicates what protocol the URL was given, such as HTTP,
FTP,NNTP and so on.
Char *specific; The name of the host plus the path
Char *hostinfo; The name of the host plus the associated protocol port
Char *hostname; The name of the host
Char *path; Specific path on the host
Char *filename; The name of the file
Char *anchor; Related to anchor
int port; Protocol-related ports
} URL;

----This is a data structure that describes some of the related properties of the URL. In fact, in the database, we store only the description of the Web page and the index information on some text and the keywords of the HTML page. We do not store the actual content of the Web page.
----Three, user query implementation description

----analysis of the implementation of user-submitted query requests:

----users want to query a certain aspect of the information is generally through the provision and the domain related to several keywords.

----Let's take a look at the related data structures and classes about user queries:

----Below is a basic structure of the word and its weight:

typedef struct WORD_WEIGHT_PAIR
{
Char Word[word_len];
int weight;
}word_weight_pair;


----The following classes are used primarily to process and analyze a user's query:
Class Cuserquery
{
Char M_userquery[max_querylen]; User's query expression
CPtrArray Word_weight_col;
is a dynamic array of structure Word_weight_pair
int m_maxreturnsum; The maximum number of pages that the user wants to return
int Search_mode;
CObArray M_returndoc; is a dynamic array of Cnetdocument objects
Normalizeword (char* Oneword); Warping the word, that is, the stem.
Find (char* odbcname); Making database lookups and matching
};

The basic steps for----system implementation are as follows:

----1. Analyze the query expressions entered by the user. In fact, the presentation of the document in the previous Spider search is described in terms of the keyword, and each document can be represented as such a collection

where::=< Word or phrase name >< Word or phrase weights >

----is actually a document represented by vector space representation.

----We also use vector space representation of query expressions for user input. We think that the order of the keywords the user entered represents the degree of importance of the word, so there is a relatively high priority for the words in front of them, and we perform STEM operations on all the contents by using phrases or words as the smallest atoms, As mentioned above: for example, the word encouraging is converted into a encourage format. Then remove the words from the stop word, such as is, as, and so on, which are stored in the STOPWORDTBL table. Then put all the warping content into the dynamic array word_weight_col.

----2. For each element in the dynamic array word_weight_col, the structure Word_weight_pair (including the word and the weight of the word), we can find records related to these words from the table worddictionarytbl. These records should include all the words in the word_weight_col.

----Calculate whether the Web page matches the query. The process of matching calculations is as follows: First, we sort all the records by URL address. Because there may be several records corresponding to a URL, and then to each page scoring, the word weight of each record is initscore*weight+ (TOTALTIMES-1) *weight* INCREMENT. Where Initscore is the base score for each word, Totaltimes is the number of times the word appears in the Web page, weight is the word that has different weights in different content segments (for example, in the keyword segment, or in the title section, or in the content segment, and so on). Increment is the increase in the number of points per occurrence of the word.

----3. Displays the top M_maxreturnsum page with the highest degree of match, based on the user-specified m_maxreturnsum.

----IV. CONCLUDING remarks

----We use the mechanism discussed above, in the Windows NT operating system, with VC + + and SQL Server to achieve a Web search engine page collection process. After building a basic search engine framework, we can implement some of our own algorithms based on this framework, such as How to better spider scheduling, how to better document classification, how to better understand the user's query, Used to make web search engines have better intelligence and personalization characteristics.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.