Introduction to Tse (tiny search engine)

Last Update:2018-12-05 Source: Internet

Author: User

Tags cdata md5 digest isearch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tse is short for tiny search engine ("micro search engine"), produced by Peking University Network lab

This lab launched the famous "Peking University Skynet search"

Skynet has trained a group of search technical experts in the early Chinese Internet era.

The technical path of BD is similar to that of TSE.

TSE includes webpage capturing, word segmentation, inverted index generation, and other modules. It can be regarded as a pocket version of Skynet.

Code is developed in C ++, which is short, lean, and highly efficient.

I feel that the actual effect is better than that of some open-source spider, which is easy to modify.

Tse web page capture

Start with the main function, in Main. cpp

If the console has one parameter, search:

Csearch Isearch;
Isearch. dosearch ();

If there are two console parameters, run the web crawler:

Ccrawl icrawl (argv [2], "visited. All ");
Icrawl. docrawl ();

Where argv [2] is inputfile visited. All is outputfile.

In the docrawl function, initialization adds the accessed URL to a collection,

Call the getvisitedurlmd5 () function to read a file:

While (Getline (ifsmd5, strmd5 )){
Setvisitedurlmd5.insert (strmd5 );
}

Setvisitedurlmd5 is a set of accessed URLs.

Similarly, getvisitedpagemd5 (); getipblock ();

Getunreachhostmd5 ();

Read the file content to setvisitedpagemd5, mapipblock, and setunreachhostmd5 respectively.

Set <string> setvisitedurlmd5;
Set <string> setvisitedpagemd5;

Map <unsigned long, unsigned long> mapipblock;

Here, setunreachhostmd5 is the URL that cannot be reached from the file, and then an MD5 value is obtained for the URL, Which is saved in the memory.

After the preliminary work is completed, start to read the seed file, that is, start to crawl the starting URL used.

First, open all the output stream files. The function is openfilesforoutput (). that is, the files to be crawled and obtained in the future. this includes index files (opened by special classes) and visited. URL, link4se. URL, link4history. URL (this may store an HTML image link), unreach host file, visited URL MD5 file, and visited page MD5 file.

For (unsigned int I = 0; I <num_workers; I ++ ){
If (pthread_create (& tids [I], null, start, this ))
Cerr <"create threads error" <Endl;
}

Start 10 threads. Each thread executes the start function and the parameter is the this pointer.

When the thread is started, read the URL of the seed file

While (Getline (ifsseed, strurl )){

// Call the addurl (strurl) function after processing the read URL

Addurl (strurl. c_str ());

}

Read the unaccessed URL at the same time.

While (Getline (ifsunvisitedurl, strurl )){

Addurl (strurl. c_str ());

}

See what the addurl function does.

Addurl (const char * URL)

In this function, first determine whether it is an image type link. If yes, add the m_ofslink4historyfile set.

String strurl = URL;

If (iurl. isimageurl (strurl ))
{
If (m_ofslink4historyfile ){
Pthread_mutex_lock (& mutexlink4historyfile );
M_ofslink4historyfile <strurl <Endl ;;
Pthread_mutex_unlock (& mutexlink4historyfile );
}
Return;
}

In the iurl. parseurlex (strurl) function, parse various information represented by strurl, such as host, scheme, port, and request

If it is not an image link, add the MD5 Digest of the host to the setunvisitedurlmd5 set. The host source is a member of iurl. m_shost.

Limit 5 imd5;
Imd5.generatemd5 (unsigned char *) iurl. m_shost.c_str (), iurl. m_shost.size ());
String strdigest = imd5.tostring ();

Setunvisitedurlmd5.insert (strdigest); note that MD5 is saved.

Save

This MMAP stores URLs that have not been accessed, not MD5. The above set is used for judgment,

The actual thread function uses MMAP.

Let's take a look at the preceding thread function start.

Start (ARG) --> fetch (ARG)

The parameter Arg is the ccrawl object pointer.

In function fetch, first open the Skynet index file to store the results

String ofsname = data_tianwang_file + "." + cstrfun: itos (pthread_self ());
Ctianwangfile tianwangfile (ofsname );

Open a link4sefile file (link4se. Raw), ofsname = data_link4se_file + "." + cstrfun: itos (pthread_self ());
Clink4sefile link4sefile (ofsname );

In the fetch function, traverse the MMAP and retrieve the

Multimap <string, string >:: iterator it = mmapurls. Begin ();

Retrieve URL:

String strurl = (* It). Second;

Call the download function again.

(Ccrawl *) Arg)-> downloadfile (& tianwangfile, & link4sefile, iurl, ngsock );

Start to download the actual web page.

In the downloadfile function, the actual download function is implemented by file_length = http. Fetch (strurllocation, & downloaded_file, & filehead, & location, & nsock.

Char * downloaded_file = NULL,
* Filehead = NULL,
* Location = NULL;

The downloaded webpage files are stored in downloaded_file, filehead, and location memory.

Then pass the content as a parameter to the cpage object.

Cpage ipage (iurl. m_surl, strurllocation, filehead, downloaded_file, file_length );

Process the crawled webpage URL and insert its MD5 value to the URL set of the crawled webpage.

Imd5.generatemd5 (unsigned char *) iurl. m_surl.c_str (), iurl. m_surl.length ());
Strdigest = imd5.tostring ();

If (setvisitedurlmd5.find (strdigest )! = Setvisitedurlmd5.end ()){

// Find the element in the set. In fact

// You do not need to check
Cout <"! Vurl: "; // 1. Crawled already

Return; // return immediately
}

Setvisitedurlmd5.insert (strdigest );
Savevisitedurlmd5 (strdigest); // write the file with the crawled URL

Then, write the webpage content to the Skynet format file.

Savetianwangrawdata (ptianwangfile, & iurl, & ipage );

Extract a hyperlink to the crawled webpage content. This is done using lex. The previous article has already written the Lex process in Tse.

First, struct URI page_uri;

Uri_parse_string (ipage. m_surl.c_str (), & page_uri );

Convert the URL of the webpage into a string to store the URL.

Call later

Hlink_detect_string (ipage. m_scontent.c_str (), & page_uri, onfind, & P );

Is the processing webpage class written in the previous article.

Tse link Extraction

In Tse, link URIs in HTML are extracted using lex analysis.

Hlink. l and Uri. L are related to Lex in Tse.

Uri. L is used to process an extracted Uri, and hlink. L is used to extract links in HTML.

Code process:

In the downloadfile () method of the crawl class, when the content of a webpage (HTML) is obtained, it is stored in a cpage class.

Start to analyze and process the cpage class. First, the URI uri_parse_string (ipage. m_surl.c_str (), & page_uri) of the webpage is processed and saved to page_uri.

Call hlink_detect_string (ipage. m_scontent.c_str (), & page_uri, onfind, & P );

This hlink_detect_string function is a key function, which means to take the webpage content as a parameter (ipage. m_scontent.c_str () is passed in. When the link information is found, the onfind function is called. The parameter of the onfind function is & P.

The hlink_detect_string function is defined in hlink. L.

Int hlink_detect_string (const char * string, const struct URI * pg_uri, onfind_t onfind, void * Arg)

The string parameter is the content of the incoming webpage. pg_uri is the address of the webpage, onfind is the function to be executed after the link is found, and Arg is the parameter, which will always be taken away.

If (BUF = yy_scan_string (string ))
{
Yy_switch_to_buffer (BUF );
_ Base_uri = (struct URI *) pg_uri;
_ Is_our_base = 0;
_ Onfind = onfind;
_ Arg = ARG;

Begin initial;
N = yylex ();

}

Buf = yy_scan_string (string) and

Yy_switch_to_buffer tells the program that the string to be analyzed is the parameter string and the input parameter is saved as a global variable.

Then enter the initial state, and the yylex () function starts the analysis process.

If you encounter the following expression in the initial state: "<" {CDATA}/{Blank} | ">" that is, if you encounter something similar to "<A>" or "<, assign the array corresponding to the row "a" to the global structure of _ cur_elem. For example, if "<a" is found, {"A", _ elem_a_attr} is assigned to _ cur_elem. The related code is:

/* Element names are case-insensitive .*/
For (yyleng = 0; _ elems [yyleng]. Name; yyleng ++)
{
If (strcasecmp (yytext + 1, _ elems [yyleng]. Name) = 0)
{
_ Cur_elem = _ elems + yyleng;
Break;
}
}

The array is defined

Static struct _ ELEM _ elems [] = {
{"A", _ elem_a_attr },
{"Area", _ elem_area_attr },
{"Base", _ elem_base_attr },
{"Frame", _ elem_frame_attr },
{"Iframe", _ elem_iframe_attr },
{"IMG", _ elem_img_attr },
{"Link", _ elem_link_attr },
{"Meta", _ elem_meta_attr },
{Null ,}
};

Static const struct _ ELEM * _ cur_elem;

Then enter attribute status

Yy_push_state (attribute );

If you want to find the attribute value in the attribute state, for example, a href = first retrieves href (no "= ")

In this case, if the matched string yytext is "href", the Code related to the string after the space and equal sign are removed is

Matched Regular Expression: {CDATA} {Blank} {0,512} "=" {Blank} {0,512}

Remove spaces and equal signs:

Yyleng = 0;
While (! Hlink_isblank (yytext [yyleng]) & yytext [yyleng]! = ')
Yyleng ++;
Yytext [yyleng] = '/0 ';

Then, store the property value "href" string in char * _ cur_attr and use _ curpos to mark the position of the array _ buffer storing the URI.

For (yyleng = 0; _ cur_elem-> attrs [yyleng]; yyleng ++)
{
If (strcasecmp (yytext, _ cur_elem-> attrs [yyleng]) = 0)
{
_ Curpos = _ buffer;
_ Cur_attr = _ cur_elem-> attrs [yyleng];
Break;
}
}

Enter the URI status and prepare to extract the URI

Begin URI;

In the URI state, if double quotation marks are encountered, they enter the double quotation marks state, and vice versa.

<URI>/"{Blank} {0,512} begin double_quoted;

<URI> "'" {Blank} {0,512} begin single_quoted;

That is to say, you have read <a href = "to prepare to read the actual Uri.

That is, it usually enters this action:

<Unquoted, double_quoted, single_quoted, entity>. |/n

. |/N is the character except reading the quotation marks (because the quotation marks indicate that href = "XXX" XXX has been read). Of course, this does not include the special HTML code "& lt.

When reading a character, the most critical statement is * _ curpos ++ = * yytext. It is to read the character into the memory array pointed to by _ curpos, and add the _ curpos pointer to 1, in this way, read until the URI is read, that is, read the quotation marks and so on.

<Double_quoted> {Blank} {0,512}/"|
<Single_quoted> {Blank} {0,512} "'" |
<Unquoted> {Blank} | "> "{

After reading the string, add the ending character "/0 ",

* (_ Curpos + 1) = * _ curpos = '/0 ';

Point a pointer to the starting position of the URI Array

PTR = _ buffer;

Then save the string-type character array to the URI struct.

Yyleng = uri_parse_buffer (PTR, _ curpos-PTR + 2, & URI );

Then, combine the URI and the URI merge of the URI into a final URI result.

Uri_merge (& Uri, _ base_uri, result );

Finally, the onfind function is called.

_ Onfind (_ cur_elem-> name, _ cur_attr, result, _ Arg)

The _ onfind function is called once a valid URI is found. Therefore, the onfind function is called multiple times when an HTML file is scanned.

The onfind function stores the obtained URI in the m_ofslink4historyfile file if it is of the IMG type.

If it is another type, it is handed over to the addurl function for processing.

This is the general process of the lex analysis HTML process in Tse and then the process of writing the TSE main code to crawl the webpage.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More