Objective:
In order to extract all the pages from the search results for future processing.
Visit Baidu Link Analysis
Name |
Value |
Description |
Wd |
Any text |
Key words |
Rn |
Can be unspecified, default is 10, maximum is 50, minimum is 1, can be set to any value |
The number of result entries contained in a page |
Pn |
Baidu default display 760, so the last page is pn=750 |
Index position of the first result |
Example:
Https://www.baidu.com/s?wd= Tiger &pn=10&rn=3
Keywords: tiger, 10th record, each page shows 3 article. So the key to the tiger is the record on page fourth.
HTML source file Analysis
The newly downloaded HTML source file format is very confusing and can be formatted using the online HTML formatting tool for readability.
Depending on my needs,,<script> elements and <style> elements in the HTML file can be skipped directly. Find the location where the search results are located.
Extracting search results (QT implementation)
In QT, parsing an HTML file using Qdomdocument or Qxmlstreamreader failed. The reason for this analysis is that qdomdocument or Qxmlstreamreader are designed for parsing XML files. The difference between HTML and XML
After finding the data, Tidylib library can solve the problem.
Tidy is a console application for Mac OS X, Linux, Windows, UNIX, and more. It corrects and cleans up HTML and XML documents by fixing markup errors and upgrading legacy code to modern standards.
libtidy
is a C static and dynamic library that developers can integrate into their applications in order to bring all of Tidy ' s PO Wer to your favorite tools. libtidy
is used today in the desktop applications, Web servers, and more.
Tidylib uses the following code:
BOOLHtmlparse::setdatas (ConstQbytearray &datas) { BOOLresult =false; Tidybuffer Output= {0}; Tidybuffer Errbuf= {0}; intrc =-1; Bool OK; Tidydoc Tdoc= Tidycreate ();//Initialize "Document"OK= Tidyoptsetbool (Tdoc, tidyxhtmlout, yes);//Convert to XHTML if(OK) RC= Tidyseterrorbuffer (Tdoc, &errbuf);//Capture Diagnostics if(RC >=0) RC= Tidyparsestring (Tdoc, Datas.data ());//Parse the input if(RC >=0) RC= Tidycleanandrepair (Tdoc);//Tidy It up! if(RC >=0) RC= Tidyrundiagnostics (Tdoc);//Kvetch if(RC >1)//If error, force output.rc = (Tidyoptsetbool (Tdoc, tidyforceoutput, yes)? RC:-1 ); if(RC >=0) RC= Tidysavebuffer (Tdoc, &output);//Pretty Print if(RC >=0 ) { if(Doc.setcontent (Qbytearray (Char*) OUTPUT.BP)))//qdomdocument Doc;{result=true; }} tidybuffree (&output); Tidybuffree (&errbuf); Tidyrelease (Tdoc); returnresult;}
Baidu Search Results HTML analysis