Crawler Technology (VI)--use Htmlagilitypack to get page links (with C # code and plugin download)

Source: Internet
Author: User

Rookie Htmlagilitypack First experience ... Weak weak code ...

The Html Agility Pack is an open source project that provides a standard DOM API and XPath navigation for Web pages. Pages downloaded using WebBrowser and HttpWebRequest can be parsed using the HTML Agility pack.

Htmlagilitypack documents are in CHM format and sometimes are not readable by CHM format files. If IE cannot be linked to the page you requested or "page cannot be displayed" after opening. Right-click on the CHM file you want to open, you will have a "unlock" in the bottom property, and you can click it to display it correctly.

If you need to download, please click htmlagilitypack.1.4.0 Download, unzip and find HtmlAgilityPack.dll, add it to the project.

The classes in HtmlAgilityPack.dll are located in the Htmlagilitypack namespace.

The HTMLDocument represents a complete HTML document. Load the Web page with the Load method.

The following is the first experience of Htmlagilitypack,

To achieve the goal :, click the button, according to the given URL, print out all the links to the page. The simple code is as follows:

1 usingSystem;2 usingSystem.Collections.Generic;3 usingSystem.ComponentModel;4 usingSystem.Data;5 usingSystem.Drawing;6 usingSystem.Linq;7 usingSystem.Text;8 usingSystem.Windows.Forms;9 usingHtmlagilitypack;Ten  One namespaceHtmlAgilityPackDemo1 A { -      Public Partial classForm1:form -     { the          PublicForm1 () -         { - InitializeComponent (); -         } +  -         Private voidForm1_Load (Objectsender, EventArgs e) +         { A  at         } -  -         Private voidButton1_Click (Objectsender, EventArgs e) -         { -Htmlweb webClient =NewHtmlweb (); -             htmlagilitypack.htmldocumentDoc = Webclient.load ("Http://www.cnblogs.com/lmei"); in  -Htmlnodecollection hreflist = doc. Documentnode.selectnodes (".//a[@href]"); to  +             if(Hreflist! =NULL) -             { the                 foreach(Htmlnode hrefinchhreflist) *                 { $Htmlattribute att = href. attributes["href"];Panax Notoginseng Console.WriteLine (Att. Value); -  the                 } +  A             } the  +         } -     } $}

When the 28th line of code above is written as follows,

htmldocument doc = webclient.load ("http://www.cnblogs.com/lmei");

An error message will appear,

The following changes are then

Htmlagilitypack.htmldocument doc = webclient.load ("http://www.cnblogs.com/lmei" );

Next, look at the output of the console, as follows:

Visible, the hyperlinks above the page are printed ...

Of course, if you want to crawl the body of the page above, may be garbled after loading problems, you can specify the encoding of the file:

Htmlagilitypack.htmldocument Htmldoc = new Htmlagilitypack.htmldocument ();
Encoding encoder = encoding.getencoding ("utf-8"); Htmldoc.load (" http://www.cnblogs.com/lmei/p/3485649.html", encoder);

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.