Baidu Search Results HTML analysis

Source: Internet
Author: User
Tags tidy

Objective:

In order to extract all the pages from the search results for future processing.

Visit Baidu Link Analysis
Name Value Description
Wd Any text Key words
Rn Can be unspecified, default is 10, maximum is 50, minimum is 1, can be set to any value The number of result entries contained in a page
Pn Baidu default display 760, so the last page is pn=750 Index position of the first result
Example:

Https://www.baidu.com/s?wd= Tiger &pn=10&rn=3

Keywords: tiger, 10th record, each page shows 3 article. So the key to the tiger is the record on page fourth.

HTML source file Analysis

The newly downloaded HTML source file format is very confusing and can be formatted using the online HTML formatting tool for readability.

Depending on my needs,,<script> elements and <style> elements in the HTML file can be skipped directly. Find the location where the search results are located.

Extracting search results (QT implementation)

In QT, parsing an HTML file using Qdomdocument or Qxmlstreamreader failed. The reason for this analysis is that qdomdocument or Qxmlstreamreader are designed for parsing XML files. The difference between HTML and XML

After finding the data, Tidylib library can solve the problem.

Tidy is a console application for Mac OS X, Linux, Windows, UNIX, and more. It corrects and cleans up HTML and XML documents by fixing markup errors and upgrading legacy code to modern standards.

libtidyis a C static and dynamic library that developers can integrate into their applications in order to bring all of Tidy ' s PO Wer to your favorite tools. libtidy is used today in the desktop applications, Web servers, and more.

Tidylib uses the following code:

BOOLHtmlparse::setdatas (ConstQbytearray &datas) {    BOOLresult =false; Tidybuffer Output= {0}; Tidybuffer Errbuf= {0}; intrc =-1;    Bool OK; Tidydoc Tdoc= Tidycreate ();//Initialize "Document"OK= Tidyoptsetbool (Tdoc, tidyxhtmlout, yes);//Convert to XHTML    if(OK) RC= Tidyseterrorbuffer (Tdoc, &errbuf);//Capture Diagnostics    if(RC >=0) RC= Tidyparsestring (Tdoc, Datas.data ());//Parse the input    if(RC >=0) RC= Tidycleanandrepair (Tdoc);//Tidy It up!    if(RC >=0) RC= Tidyrundiagnostics (Tdoc);//Kvetch    if(RC >1)//If error, force output.rc = (Tidyoptsetbool (Tdoc, tidyforceoutput, yes)? RC:-1 ); if(RC >=0) RC= Tidysavebuffer (Tdoc, &output);//Pretty Print    if(RC >=0 )    {        if(Doc.setcontent (Qbytearray (Char*) OUTPUT.BP)))//qdomdocument Doc;{result=true; }} tidybuffree (&output); Tidybuffree (&errbuf);    Tidyrelease (Tdoc); returnresult;}

Baidu Search Results HTML analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.