Use Scrapysharp to quickly capture data from a Web page

Source: Internet
Author: User

Use Scrapysharp to quickly capture data from a Web page

Scrapysharp is a library to help us quickly achieve web data collection, it mainly provides the following two functions

    1. Get HTML data from a URL
    2. Parsing an HTML node in a way that provides a CSS selector

Installation:

Scrapysharp can be downloaded directly from Nuget by entering the following commands directly from the package Console:

Pm> Install-package Scrapysharp

HTML download

First, let's take a look at its HTML download function, which is implemented by the Scrapingbrowser class:

var browser = new scrapingbrowser();
var html = browser. Downloadstring(newUri("http://www.cnblogs.com/"));

This is just a simple example, in fact scrapingbrowser functions are very comprehensive, common features such as: CharSet detection, Autoredirect, Cache, Proxy, Cookie, useragent, form submission, etc. are supported very well, it is more convenient to use it to obtain the page than Httclient.

HTML parsing

Scrapysharp 's Html parsing is based on the famous Htmlagilitypack , which mainly provides two extension functions cssselect and Cssselect:

StaticIEnumerable<Htmlnode> Cssselect (ThisHtmlnodeNodeStringexpression);
StaticIEnumerable<Htmlnode> Cssselect (ThisIEnumerable<Htmlnode> Nodes,Stringexpression);
StaticIEnumerable<htmlnode> cssselectancestors ( thishtmlnode node, string expression);
    staticienumerable Htmlnode> cssselectancestors (this ienumerable<htmlnode> nodes, string

Compared to the hierarchical parsing and Xpath method provided by Htmlagilitypack ,CSS selector is more simple and quick, here to parse the blog home page title For example, first use the developer tool to locate the title , you can see its HTML structure in the following way:

The parsed code is as follows:

VarDoc =NewHTMLDocument();
Doc.Loadhtml(HTML);

VarDocnode = doc.Documentnode
     Var nodes = Docnode. cssselect ();
    foreachvar Htmlnode nodes)
    {
        console writeline Innertext     }

The key code is only docnode.cssselect () Sentence, very concise. In addition, because the CSS is more flexible, the following way can also get to the title

var nodes = Docnode. Cssselect(". Post_item_body > H3");
var nodes = Docnode. Cssselect("Div#post_list"). Cssselectancestors("H3");

Finally, a list of commonly used CSS queries to facilitate subsequent use:

Html.Cssselect("Div");All DIV elements
Html.Cssselect("Div.content");All DIV elements with CSS class ' content '
Html.Cssselect("Div.widget.monthlist");All DIV elements with the both CSS class
Html.Cssselect("#postPaging");All HTML elements with the ID postpaging
Html.Cssselect ( "Div#postpaging.testclass" //all HTML elements with the ID Postpaging and CSS class TestClass
cssselect ();     //p elements who is direct children of DIV elements with CSS Clas S ' content '
    html. Cssselect ( "Input[type = Text].login" );     //textbox with CSS class login /span>

For more CSS selectors, refer to W3 's Web page: CSS Selector reference manual

Use Scrapysharp to quickly capture data from a Web page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.