Crawl Web page data in bulk using Htmlagilitypack

Source: Internet
Author: User

"Go" using Htmlagilitypack to crawl Web data in bulk

Related software Click to download

The processing of the login. Because some of the web data needs to be logged in order to extract. Here you use Iehttpheaders to extract the commit information at login.

Crawling Web pages

htmlagilitypack.htmldocument Htmldoc;

if (!string. IsNullOrEmpty (login URL))
{
Htmldoc=htmlweb.load (login URL, submit user authentication information, get data page URL);
}
Else
{
Htmldoc=htmlweb.load (Gets the page URL of the data);
}
        

ArrayList List= NewArrayList ();
List.add ("//table/tr[1]/td");
List.add ("//table/tr[2]/td");
//gets the XPath of the loop node, for example://table/tr
htmlnodecollection repeatnodes=HtmlDoc.DocumentNode.SelectNodes ("//table/tr");

//Loop Node
            foreach(Htmlnode nodeinchrepeatnodes)
{
//Loop Fetch Data
                foreach (stringDataPathinchlist)
{

Htmlnode DataNode=node. selectSingleNode (list);
if(DataNode!= NULL)
{
stringtext=Datanode.innertext;
}

}
}

If garbled, adjust the encoding set to gb2312 or Utf-8

htmlweb.defaultencoding=System.Text.Encoding.GetEncoding (strencode);


-------------------------------------------------------------------------------------------

using System;

using System.Collections.Generic;

using System.Text;

using Microsoft.VisualStudio.TestTools.WebTesting;

using Htmlagilitypack;

Public class webtest1coded : WebTest

{

Public Override IEnumerator < WebTestRequest > Getrequestenumerator ()

{

webtestrequest new webtestrequest ( " http://www.microsoft.com/" );

request1. Validateresponse + = new eventhandler < validationeventargs > (request1_validateresponse);

yield return Request1;

}

void request1_validateresponse (object sender, ValidationEventArgs e)

{

Load the response body string as an htmlagilitypack.htmldocument

Htmlagilitypack. HTMLDocument doc = New Htmlagilitypack. HTMLDocument ();

Doc. Loadhtml (e.response.bodystring);

Locate the "Nav" element

Htmlnode Navnode = doc. getElementById ("Nav");

Pick the first <li> element

Htmlnode Firstnavitemnode = Navnode.selectsinglenode (".//li");

Validate the first list item in the NAV element says "Windows"

E.isvalid = Firstnavitemnode.innertext = = "Windows" ;

}

}

Crawl Web page data in bulk using Htmlagilitypack

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.