Tse is short for tiny search engine ("micro search engine"), produced by Peking University Network lab
This lab launched the famous "Peking University Skynet search"
Skynet has trained a group of search technical experts in the early Chinese Internet era.
The technical path of BD is similar to that of TSE.
TSE includes webpage capturing, word segmentation, inverted index generation, and other modules. It can be regarded as a pocket version of Skynet.
Code is developed in C ++, which is short, lean, and highly efficient.
I feel that the actual effect is better than that of some open-source spider, which is easy to modify.
Start with the main function, in Main. cpp
If the console has one parameter, search:
Csearch Isearch;
Isearch. dosearch ();
If there are two console parameters, run the web crawler:
Ccrawl icrawl (argv [2], "visited. All ");
Icrawl. docrawl ();
Where argv [2] is inputfile visited. All is outputfile.
In the docrawl function, initialization adds the accessed URL to a collection,
Call the getvisitedurlmd5 () function to read a file:
While (Getline (ifsmd5, strmd5 )){
Setvisitedurlmd5.insert (strmd5 );
}
Setvisitedurlmd5 is a set of accessed URLs.
Similarly, getvisitedpagemd5 (); getipblock ();
Getunreachhostmd5 ();
Read the file content to setvisitedpagemd5, mapipblock, and setunreachhostmd5 respectively.
Set <string> setvisitedurlmd5;
Set <string> setvisitedpagemd5;
Map <unsigned long, unsigned long> mapipblock;
Here, setunreachhostmd5 is the URL that cannot be reached from the file, and then an MD5 value is obtained for the URL, Which is saved in the memory.
After the preliminary work is completed, start to read the seed file, that is, start to crawl the starting URL used.
First, open all the output stream files. The function is openfilesforoutput (). that is, the files to be crawled and obtained in the future. this includes index files (opened by special classes) and visited. URL, link4se. URL, link4history. URL (this may store an HTML image link), unreach host file, visited URL MD5 file, and visited page MD5 file.
For (unsigned int I = 0; I <num_workers; I ++ ){
If (pthread_create (& tids [I], null, start, this ))
Cerr <"create threads error" <Endl;
}
Start 10 threads. Each thread executes the start function and the parameter is the this pointer.
When the thread is started, read the URL of the seed file
While (Getline (ifsseed, strurl )){
// Call the addurl (strurl) function after processing the read URL
Addurl (strurl. c_str ());
}
Read the unaccessed URL at the same time.
While (Getline (ifsunvisitedurl, strurl )){
Addurl (strurl. c_str ());
}
See what the addurl function does.
Addurl (const char * URL)
In this function, first determine whether it is an image type link. If yes, add the m_ofslink4historyfile set.
String strurl = URL;
If (iurl. isimageurl (strurl ))
{
If (m_ofslink4historyfile ){
Pthread_mutex_lock (& mutexlink4historyfile );
M_ofslink4historyfile <strurl <Endl ;;
Pthread_mutex_unlock (& mutexlink4historyfile );
}
Return;
}
In the iurl. parseurlex (strurl) function, parse various information represented by strurl, such as host, scheme, port, and request
If it is not an image link, add the MD5 Digest of the host to the setunvisitedurlmd5 set. The host source is a member of iurl. m_shost.
Limit 5 imd5;
Imd5.generatemd5 (unsigned char *) iurl. m_shost.c_str (), iurl. m_shost.size ());
String strdigest = imd5.tostring ();
Setunvisitedurlmd5.insert (strdigest); note that MD5 is saved.
Save
This MMAP stores URLs that have not been accessed, not MD5. The above set is used for judgment,
The actual thread function uses MMAP.
Let's take a look at the preceding thread function start.
Start (ARG) --> fetch (ARG)
The parameter Arg is the ccrawl object pointer.
In function fetch, first open the Skynet index file to store the results
String ofsname = data_tianwang_file + "." + cstrfun: itos (pthread_self ());
Ctianwangfile tianwangfile (ofsname );
Open a link4sefile file (link4se. Raw), ofsname = data_link4se_file + "." + cstrfun: itos (pthread_self ());
Clink4sefile link4sefile (ofsname );
In the fetch function, traverse the MMAP and retrieve the
Multimap <string, string >:: iterator it = mmapurls. Begin ();
Retrieve URL:
String strurl = (* It). Second;
Call the download function again.
(Ccrawl *) Arg)-> downloadfile (& tianwangfile, & link4sefile, iurl, ngsock );
Start to download the actual web page.
In the downloadfile function, the actual download function is implemented by file_length = http. Fetch (strurllocation, & downloaded_file, & filehead, & location, & nsock.
Char * downloaded_file = NULL,
* Filehead = NULL,
* Location = NULL;
The downloaded webpage files are stored in downloaded_file, filehead, and location memory.
Then pass the content as a parameter to the cpage object.
Cpage ipage (iurl. m_surl, strurllocation, filehead, downloaded_file, file_length );
Process the crawled webpage URL and insert its MD5 value to the URL set of the crawled webpage.
Imd5.generatemd5 (unsigned char *) iurl. m_surl.c_str (), iurl. m_surl.length ());
Strdigest = imd5.tostring ();
If (setvisitedurlmd5.find (strdigest )! = Setvisitedurlmd5.end ()){
// Find the element in the set. In fact
// You do not need to check
Cout <"! Vurl: "; // 1. Crawled already
Return; // return immediately
}
Setvisitedurlmd5.insert (strdigest );
Savevisitedurlmd5 (strdigest); // write the file with the crawled URL
Then, write the webpage content to the Skynet format file.
Savetianwangrawdata (ptianwangfile, & iurl, & ipage );
Extract a hyperlink to the crawled webpage content. This is done using lex. The previous article has already written the Lex process in Tse.
First, struct URI page_uri;
Uri_parse_string (ipage. m_surl.c_str (), & page_uri );
Convert the URL of the webpage into a string to store the URL.
Call later
Hlink_detect_string (ipage. m_scontent.c_str (), & page_uri, onfind, & P );
Is the processing webpage class written in the previous article.
In Tse, link URIs in HTML are extracted using lex analysis.
Hlink. l and Uri. L are related to Lex in Tse.
Uri. L is used to process an extracted Uri, and hlink. L is used to extract links in HTML.
Code process:
In the downloadfile () method of the crawl class, when the content of a webpage (HTML) is obtained, it is stored in a cpage class.
Start to analyze and process the cpage class. First, the URI uri_parse_string (ipage. m_surl.c_str (), & page_uri) of the webpage is processed and saved to page_uri.
Call hlink_detect_string (ipage. m_scontent.c_str (), & page_uri, onfind, & P );
This hlink_detect_string function is a key function, which means to take the webpage content as a parameter (ipage. m_scontent.c_str () is passed in. When the link information is found, the onfind function is called. The parameter of the onfind function is & P.
The hlink_detect_string function is defined in hlink. L.
Int hlink_detect_string (const char * string, const struct URI * pg_uri, onfind_t onfind, void * Arg)
The string parameter is the content of the incoming webpage. pg_uri is the address of the webpage, onfind is the function to be executed after the link is found, and Arg is the parameter, which will always be taken away.
If (BUF = yy_scan_string (string ))
{
Yy_switch_to_buffer (BUF );
_ Base_uri = (struct URI *) pg_uri;
_ Is_our_base = 0;
_ Onfind = onfind;
_ Arg = ARG;
Begin initial;
N = yylex ();
}
Buf = yy_scan_string (string) and
Yy_switch_to_buffer tells the program that the string to be analyzed is the parameter string and the input parameter is saved as a global variable.
Then enter the initial state, and the yylex () function starts the analysis process.
If you encounter the following expression in the initial state: "<" {CDATA}/{Blank} | ">" that is, if you encounter something similar to "<A>" or "<, assign the array corresponding to the row "a" to the global structure of _ cur_elem. For example, if "<a" is found, {"A", _ elem_a_attr} is assigned to _ cur_elem. The related code is:
/* Element names are case-insensitive .*/
For (yyleng = 0; _ elems [yyleng]. Name; yyleng ++)
{
If (strcasecmp (yytext + 1, _ elems [yyleng]. Name) = 0)
{
_ Cur_elem = _ elems + yyleng;
Break;
}
}
The array is defined
Static struct _ ELEM _ elems [] = {
{"A", _ elem_a_attr },
{"Area", _ elem_area_attr },
{"Base", _ elem_base_attr },
{"Frame", _ elem_frame_attr },
{"Iframe", _ elem_iframe_attr },
{"IMG", _ elem_img_attr },
{"Link", _ elem_link_attr },
{"Meta", _ elem_meta_attr },
{Null ,}
};
Static const struct _ ELEM * _ cur_elem;
Then enter attribute status
Yy_push_state (attribute );
If you want to find the attribute value in the attribute state, for example, a href = first retrieves href (no "= ")
In this case, if the matched string yytext is "href", the Code related to the string after the space and equal sign are removed is
Matched Regular Expression: {CDATA} {Blank} {0,512} "=" {Blank} {0,512}
Remove spaces and equal signs:
Yyleng = 0;
While (! Hlink_isblank (yytext [yyleng]) & yytext [yyleng]! = ')
Yyleng ++;
Yytext [yyleng] = '/0 ';
Then, store the property value "href" string in char * _ cur_attr and use _ curpos to mark the position of the array _ buffer storing the URI.
For (yyleng = 0; _ cur_elem-> attrs [yyleng]; yyleng ++)
{
If (strcasecmp (yytext, _ cur_elem-> attrs [yyleng]) = 0)
{
_ Curpos = _ buffer;
_ Cur_attr = _ cur_elem-> attrs [yyleng];
Break;
}
}
Enter the URI status and prepare to extract the URI
Begin URI;
In the URI state, if double quotation marks are encountered, they enter the double quotation marks state, and vice versa.
<URI>/"{Blank} {0,512} begin double_quoted;
<URI> "'" {Blank} {0,512} begin single_quoted;
That is to say, you have read <a href = "to prepare to read the actual Uri.
That is, it usually enters this action:
<Unquoted, double_quoted, single_quoted, entity>. |/n
. |/N is the character except reading the quotation marks (because the quotation marks indicate that href = "XXX" XXX has been read). Of course, this does not include the special HTML code "& lt.
When reading a character, the most critical statement is * _ curpos ++ = * yytext. It is to read the character into the memory array pointed to by _ curpos, and add the _ curpos pointer to 1, in this way, read until the URI is read, that is, read the quotation marks and so on.
<Double_quoted> {Blank} {0,512}/"|
<Single_quoted> {Blank} {0,512} "'" |
<Unquoted> {Blank} | "> "{
After reading the string, add the ending character "/0 ",
* (_ Curpos + 1) = * _ curpos = '/0 ';
Point a pointer to the starting position of the URI Array
PTR = _ buffer;
Then save the string-type character array to the URI struct.
Yyleng = uri_parse_buffer (PTR, _ curpos-PTR + 2, & URI );
Then, combine the URI and the URI merge of the URI into a final URI result.
Uri_merge (& Uri, _ base_uri, result );
Finally, the onfind function is called.
_ Onfind (_ cur_elem-> name, _ cur_attr, result, _ Arg)
The _ onfind function is called once a valid URI is found. Therefore, the onfind function is called multiple times when an HTML file is scanned.
The onfind function stores the obtained URI in the m_ofslink4historyfile file if it is of the IMG type.
If it is another type, it is handed over to the addurl function for processing.
This is the general process of the lex analysis HTML process in Tse and then the process of writing the TSE main code to crawl the webpage.