[Tse Study Notes of Peking University Skynet search engine] section 9th -- Display Search Results

Last Update:2018-12-03 Source: Internet

Author: User

Tags web database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This section describes step 6 of the search function Entry Program tsesearch. cpp-display search results. This section first defines the idisplayrst object of the result display class cdisplayrst, and then calls the three member functions of the class showtop, showmiddle and showbelow show the header, center, and bottom of the result page respectively (area marked ① ③ in Figure 1 ). Next, let's take a look at the source code of these three functions. The Code includes detailed annotations (the annotations starting with "lb_c" are added to me, in addition, there are some problems after source code analysis.

1. showtop

Bool cdisplayrst: showtop () {string strhost = string (getenv ("http_host"); // lb_c: result cout <"<body bgcolor = # ffffff topmargin = 2 marginheight = 2>" <"<Table class = border = 0 width = 100% cellspacing = 0 cellpadding = 0 Height = 29> "<Endl <" <tr> "<Endl // lb_c: the "Skynet search" logo image in the upper left corner adds a link to strhost/YC/TSE/to the image (strhost is the root directory of the website, that is, the website directory configured by Apache, // previously set to/var/www/html /). In this way, click the logo to open the index.html in strhost/YC/TSE/. The webpage is actually the same as the index.html in strhost. // You can also search for the main page, and you may not directly link it to the index.html in strhost? <"<TD width = 36% rowspan = 2 Height = 1>" <"<a href = http: // "<strhost <"/YC/TSE/>  </a> </TD> "<Endl // lb_c:" Search Homepage "and" help users "at the top of the page, the former is the same as the above logo link, and the latter is linked to a help manual on the network. <"<TD width = 64% Height = 33> <font size = 2> <a href = http: // "<strhost </YC/TSE/> Search home page </a> | <a href = http://e.pku.edu.cn/gbhelp.htm> help </a> </font> <br> </TD> "<Endl <" </tr> "<Endl; // lb_c: A New Query Form is built on the top of the page, including the search box, search button, and implicit display page number key value pairs cout <"<tr>" <Endl <"<TD> <p align = \" Left \ "> "<Endl <" <form method = \ "Get \" Action = \ "/YC-cgi-bin/index/tsesearch \" name = \ "TW \"> "<Endl // lb_c: search input box <"<Input Type = \ "text \" name = \ "word \" size = \ "55 \"> "<Endl // lb_c: New query button, note that name <"<input type = \" Submit \ "value = \" New QUERY \ ">" <Endl // lb_c: the appended key-value pair is named "start" and the value is 1. Here is the 1st page indicating the search result. <"<Input type = \" Hidden \ "name = \" Start \ "value = \" 1 \ ">" <Endl <"</form>" <<Endl <"</tr>" <Endl <"</table>" <Endl; // lb_c: The Blue horizontal bar in the middle, the image "cout <" <Table border = 0 width = 100% cellspacing = 1 cellpadding = 0 Height = 1> "<Endl <" <tr> "< endl <"<TD width = 68 align = center bgcolor = #000066 valign = middle> <font size = 2> <B> <font color = # ffffff> image </ font> </B> </font> </TD> "<Endl <" </tr> "<Endl <" <tr> "<" <TD width = 100% align = left colspan = 3 Height = 0> "<" </TD> </tr> "<Endl <" </table> "< endl; return true ;}

[Analysis 1]: As mentioned in section 6th, If you perform a new search on the search results page, the results are displayed normally. Why? Take a look at the new Query Form in the Code. The form contains three inputs, and only two inputs have names. Therefore, after you click the "new query" button, the submitted URL contains two key-value pairs (Word and start ).

Http: // localhost: 8080/YC-cgi-bin/index/tsesearch? WORD = % B1 % B1 % be % A9 % B4 % F3 % D1 % A7 & START = 1,

Therefore, htmlinputs [1] corresponds to the START key-value pair. The value of htmlinputs [1]. value is 1, so m_istart is set to 1, that is, 1st page of the result set is displayed. If the name is defined in the input tag of the "new query" button, there will be three key-value pairs in the URL, and the second key-value pair corresponds to the "new query" button, htmlinputs [1] is not the right start key value, so an error will also occur! You can test it.

2. showmiddle

// Lb_c: strquery is the original user query string, fusedmsec is the search time consumed, irstnum is the total number of search results, and start is the page bool cdisplayrst of the display result set :: showmiddle (string strquery, float fusedmsec, unsigned irstnum, unsigned start) {// lb_c: ipagenum indicates the total number of results pages, and rstperpage indicates the number of entries displayed on each page, is a constant unsigned ipagenum = 0; If (irstnum % rstperpage = 0) {ipagenum = irstnum/rstperpage;} else {ipagenum = irstnum/rstperpage + 1;}/lb_c: display prompt: the user queries the string, the time consumed for searching, and the total number of results. Currently, X to Y cout are displayed <"<titl E> Tse search </title> \ n "<" <font color = #008080 size = 2> "<Endl <" Search: <B> <font color = \ "#000000 \" size = \ "2 \"> "<strquery <" </B> </font> "<Endl <"Time consumed <B> <font color = \" #000000 \ "size = \" 2 \ ">" <fusedmsec <"</font> </B> millisecond, <B> <font color = \ "#000000 \" size = \ "2 \"> "<irstnum <" </font> </B>, <B> <font color = \ "#000000 \" size = \ "2 \"> "; if (irstnum = 0) {cout <"0 </font> </B> to <B> <font color = \"# 000000 \ "size = \" 2 \ ">" <"0 </font> </B> <br>" <Endl; return true ;} cout <(START-1) * rstperpage + 1 <"</font> </B> to <B> <font color = \" #000000 \ "size = \" 2 \ "> "; if (irstnum> = start * rstperpage) {cout <start * rstperpage <"</font> </B>" <Endl ;} else {cout <irstnum <"</font> </B>" <Endl ;}// lb_c: the link to the result page selected by the user, cout <"Select page:"; for (unsigned I = 0; I <ipagenum; I ++) {// lb_c: the page number of the current page is not linked if (I + 1 = Start) {cout <I + 1 <"</a>" ;}// lb_c: other page numbers provide links, note that the link is "/YC-cgi-bin/index/tsesearch? WORD = *** & START = ***", [Analysis 2] else {cout <"<a href = \"/YC-cgi-bin/index/tsesearch? WORD = "<strquery <" & START = "<I + 1 <" \ ">" <I + 1 <"</a> ";}} return true ;}

[Analysis 2]: As mentioned in section 6th, If you perform a new search on the search results page, the results are displayed normally. Why? The above Code shows that the page number link is "YC-cgi-bin/index/tsesearch? WORD = *** & START = *** "format, obviously the second key-value pair is start, so htmlinputs [1]. value is the page number clicked by the user, so the m_istart value is correct, so the display result is normal.

In addition, the page number link shows that after clicking the page number, the user executes another CGI program/YC-cgi-bin/index/tsesearch, that is, the user searches again, instead of directly retrieving the content of the corresponding page in the previous result set for display, why? This seems unreasonable. The search result set already exists. You only need to simply display different pages.

3. showbelow

Bool cdisplayrst: showbelow (vector <string> & vecquery, set <string> & setrelevantrst, vector <docidx> & vecdocidx, unsigned start) {cout <"<ol>" <Endl; set <string>: iterator it = setrelevantrst. begin (); unsigned idocnumber = 0; // lb_c: Start indicates the page number of the result set selected by the user. rstperpage indicates the number of records displayed on each page, therefore, the start and end numbers of the results to be displayed are calculated here. // The Result Records from irstbegin to irstend are displayed. Here we can also see that the start page number should start from 1. Unsigned irstbegin = (START-1) * rstperpage; unsigned irstend = start * rstperpage-1; vector <string> vecrefurl; vector <string>: iterator itvecrefurl; cout <"<tr bgcolor = # e7eefc>"; bool bcolor = true; // lb_c: Open the original web database, when you click "Web page snapshot", you need to read the web page and display it. This also indicates that the web page snapshot contains historical data of the server, rather than opening the real-time web page obtained by the web site. Ifstream ifs (rawpage_file_name.c_str (); If (! IFS) {cout <"cannot open" <rawpage_file_name <"for input \ n"; return false ;}for (; it! = Setrelevantrst. End (); ++ it, idocnumber ++) {// lb_c: Judge the sequence numbers of the two rows. In setrelevantrst, retrieve the records from irstbegin to irstend. If (idocnumber <irstbegin) continue; If (idocnumber> irstend) break; cout <"<li> <font color = black size = 2>" <Endl; // lb_c: Obtain the docidint docid of the result record = atoi (* it ). c_str (); // lb_c: vecdocidx indicates in the main function that it is a webpage index table (record the docing from docid to offset ), here, we get the offset between the front and back pages in the // original webpage database and subtract the length of the webpage. Int length = vecdocidx [docid + 1]. offset-vecdocidx [docid]. offset; // lb_c: create a buffer Pcontent, read the webpage data from the original webpage database file char * Pcontent = new char [Length + 1]; memset (Pcontent, 0, length + 1); ifs. seekg (vecdocidx [docid]. offset); ifs. read (Pcontent, length); char * s; S = Pcontent; string URL, TMP = Pcontent; string: size_type idx1 = 0, idx2 = 0; // lb_c: extract the URL from the webpage data. idx1 = TMP. find ("url:"); If (idx1 = string: NPOs) continue; idx 2 = TMP. find ("\ n", idx1); If (idx1 = string: NPOs) continue; url = TMP. substr (idx1 + 5, idx2-idx1-5); // lb_c: vecquery is a keyword after the search string is split in the main function, connect these keywords with "+" // display them in a web snapshot to prompt users. String word; For (unsigned int I = 0; I <vecquery. size (); I ++) {word = word + "+" + vecquery [I];} Word = word. substr (1 ); // ================================================ ========================================================== ============================================/// lb_c: the specific content of each result record is output below, including: webpage link, webpage length, webpage snapshot link and webpage Content Abstract // lb_c: webpage snapshot link to another CGI program: /YC-cgi-bin/index/snapshot. After you click "Web page snapshot", the CGI program // YC-cgi-bin/index/snapshot will process it. [Analysis 3] cout <"<a href =" <URL <">" <URL </a>, "<length <" <font color = #008080> bytes </font> "<", "<" <a href =/YC-cgi-bin/index/snapshot? "<" Word = "<word <" & "<" url = "<URL <" target = _ blank> "<" [webpage snapshot] </a> "<Endl <" <br> "; if (length> 400*1024) {// if more than 400 kbdelete [] Pcontent; continue;} // lb_c: extract the body from the webpage data, then, extract the webpage abstract from the body and display it. I will not explain it in detail here. // Skip head int bytesread = 0, newlines = 0; while (newlines! = 2 & bytesread! = HEADER_BUF_SIZE-1) {If (* s = '\ n') newlines ++; elsenewlines = 0; s ++; bytesread ++;} If (bytesread = HEADER_BUF_SIZE-1) continue; // skip headerbytesread = 0, newlines = 0; while (newlines! = 2 & bytesread! = HEADER_BUF_SIZE-1) {If (* s = '\ n') newlines ++; elsenewlines = 0; s ++; bytesread ++;} If (bytesread = HEADER_BUF_SIZE-1) continue; cdocument idocument; idocument. removetags (s); idocument. m_sbodynotags = s; Delete [] Pcontent; string line = idocument. m_sbodynotags; cstrfun: replacestr (line, "", ""); cstrfun: emptystr (line ); // set "\ t \ r \ n" to "" // abstractstring reserve; If (unsigned char) line. at (48) <0x8 0) {reserve = line. substr (0, 48);} else {reserve = line. substr (0, 48 + 1);} reserve = "[" + reserve + "]"; unsigned int resnum = 128; If (vecquery. size () = 1) resnum = 256; For (unsigned int I = 0; I <vecquery. size (); I ++) {string: size_type idx = 0, cur_idx; idx = line. find (vecquery [I], idx); If (idx = string: NPOs) continue; If (idx> resnum) {cur_idx = idx-resnum; while (unsigned char) line. at (cur_idx)> 0x8 0 & cur_idx! = Idx) {cur_idx ++;} reserve + = line. substr (cur_idx + 1, resnum * 2);} else {reserve + = line. substr (idx, resnum * 2);} reserve + = "... "; // highlightstring newkey =" <font color = # e10900> "+ vecquery [I] +" </font> "; cstrfun: replacestr (reserve, vecquery [I], newkey);} line = reserve; cout <line <Endl; // ================================================ ========================================================== =======================================} cout <"</OL> "; cout <"<br> <HR> <br>"; cout <"2004 Peking University Network lab <br> \ n "; cout <"</center> </body> \ n <HTML>"; return true ;}

[Analysis 3]: The implementation of the web page snapshot function is not described in this series of articles in detail in snapshot. cpp. But one thing is pointed out here, from snapshot. according to the CPP source code, the CGI program processed by the snapshot function reads the webpage data from the original webpage database based on the imported webpage URL, and searching for webpage data is also complicated, load the URL index file first, then find the corresponding docid in the file based on the MD5 value of the input URL, and then find the data of the webpage from the original webpage database for display. Why?
In showbelow, isn't the webpage data of the result webpage obtained? It can be cached and displayed directly when a webpage snapshot is displayed?

By:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More