Use Scrapysharp to quickly capture data from a Web page
Scrapysharp is a library to help us quickly achieve web data collection, it mainly provides the following two functions
- Get HTML data from a URL
- Parsing an HTML node in a way that provides a CSS selector
Installation:
Scrapysharp can be downloaded directly from Nuget by entering the following commands directly from the package Console:
Pm> Install-package Scrapysharp
HTML download
First, let's take a look at its HTML download function, which is implemented by the Scrapingbrowser class:
var browser = new scrapingbrowser();
var html = browser. Downloadstring(newUri("http://www.cnblogs.com/"));
This is just a simple example, in fact scrapingbrowser functions are very comprehensive, common features such as: CharSet detection, Autoredirect, Cache, Proxy, Cookie, useragent, form submission, etc. are supported very well, it is more convenient to use it to obtain the page than Httclient.
HTML parsing
Scrapysharp 's Html parsing is based on the famous Htmlagilitypack , which mainly provides two extension functions cssselect and Cssselect:
StaticIEnumerable<Htmlnode> Cssselect (ThisHtmlnodeNodeStringexpression);
StaticIEnumerable<Htmlnode> Cssselect (ThisIEnumerable<Htmlnode> Nodes,Stringexpression);
StaticIEnumerable<htmlnode> cssselectancestors ( thishtmlnode node, string expression);
staticienumerable Htmlnode> cssselectancestors (this ienumerable<htmlnode> nodes, string
Compared to the hierarchical parsing and Xpath method provided by Htmlagilitypack ,CSS selector is more simple and quick, here to parse the blog home page title For example, first use the developer tool to locate the title , you can see its HTML structure in the following way:
The parsed code is as follows:
VarDoc =NewHTMLDocument();
Doc.Loadhtml(HTML);
VarDocnode = doc.Documentnode
Var nodes = Docnode. cssselect ();
foreachvar Htmlnode nodes)
{
console writeline Innertext }
The key code is only docnode.cssselect () Sentence, very concise. In addition, because the CSS is more flexible, the following way can also get to the title
var nodes = Docnode. Cssselect(". Post_item_body > H3");
var nodes = Docnode. Cssselect("Div#post_list"). Cssselectancestors("H3");
Finally, a list of commonly used CSS queries to facilitate subsequent use:
Html.Cssselect("Div");All DIV elements
Html.Cssselect("Div.content");All DIV elements with CSS class ' content '
Html.Cssselect("Div.widget.monthlist");All DIV elements with the both CSS class
Html.Cssselect("#postPaging");All HTML elements with the ID postpaging
Html.Cssselect ( "Div#postpaging.testclass" //all HTML elements with the ID Postpaging and CSS class TestClass
cssselect (); //p elements who is direct children of DIV elements with CSS Clas S ' content '
html. Cssselect ( "Input[type = Text].login" ); //textbox with CSS class login /span>
For more CSS selectors, refer to W3 's Web page: CSS Selector reference manual
Use Scrapysharp to quickly capture data from a Web page