NetEase News Page Information grabbing--htmlagilitypack with Scrapysharp

Source: Internet
Author: User

Recently in the web crawler this aspect, the Internet to see About Htmlagilitypack collocation scrapysharp article, so decided to try ~

So I went to Https://www.nuget.org/packages/ScrapySharp to see,

See this phrase download tip: To install SCRAPYSHARP, run the following command in the Package Manager Console

Pm> Install-package Scrapysharp

Then I'll go to the Package Manager console, (http://docs.nuget.org/docs/start-here/using-the-package-manager-console)

operating instructions: from the Tools menu, select Library Package Manager And then click Package Manager Console .

Found not yet installed [email protected][email protected]!!

Then go and install the plugin! Read this blog post http://www.cnblogs.com/baiyu/archive/2011/09/07/2170028.html

First, install NuGet

1. Visual Studio 2012-> tool-> Extension Manager.

2. Select Online Gallery, enter NuGet in the search in the top right corner, and then follow the prompts to install.

3. After installation menu view-> Other Windows will appear in the Package Manager console, which is a console tool integrated into vs.

Note: The Htmlagilitypack version should also be considered when selecting the version of Scrapysharp.

Attached: scrapysharp website Link: https://www.nuget.org/packages/ScrapySharp

As a result, continue to operate Tools--Library Package manager-> packageManager Console

pm> install- Package Htmlagilitypack 1.4. 6  1.4. 6  1.4. 6  1.4. 6"added to WindowsFormsDemo0320.
pm> install- Package Scrapysharp 1.4. 6  2.2.  the  2.2.  the  2.2.  the  2.2. "added to WindowsFormsDemo0320.

Then we start to crawl,

The original page is NetEase one News page: http://news.163.com/14/0413/18/9PNVIBV000014JB6.html

The following implementation of the effect is to grab the title tag content and body content (that is, <div id= "Endtext"; (Capture <p></p> Middle content) ...</div>)

Capture the title when you need to note that sometimes a page is not only a pair of title tags!!

However, in fact NetEase News page displays the title of the stored label

So the core code for extracting headers is

String title = Doc. Documentnode.selectsinglenode ("//h1[@id = ' h1title ']"). InnerText;

The core code for capturing body content:

Html. Cssselect ("P"). Cssselectancestors ("div#endtext");

Here's the HTML code for the body part of the News page:

 <Div  id= "Endtext"  ></P><P>Lanzhou April 13, the Daily News of Lanzhou City held a press conference this afternoon, preliminarily identified the cause of the flow of water in the flow of benzene exceeded. According to the preliminary analysis of environmental experts, the surrounding underground oily water is the direct cause of the exceeding of benzene in the water body of artesian ditch.</P><P>According to the current investigation, the reasons for the formation of oily water in the vicinity of the artesian ditch are three points: first, the raw material power plant crude oil distillation workshop r205a# slag oil tank had a physical blasting accident on December 28, 1987 8:50, the tank burst caused 90 cubic residue oil, Among them, 34 tons of residue is not recycled, infiltration into the ground; second, raw materials power plant crude oil distillation workshop pump B-113 export Manager on April 3, 2002, a cracking fire, leakage of residual oil and fire in the process of the production of a large number of fire-fighting sewage infiltration into the ground.</P><P>According to the Xinhua Beijing, April 13, Lanzhou&nbsp;Lanzhou City "4 11" local tap water benzene index exceeding accident emergency treatment lead group deputy leader Keung 13th, said the investigation team from 11th 3 o'clock in the afternoon began investigation work, the excavation of deep pits, the method to find the cause of benzene exceeding the direction of the water. According to the preliminary analysis of environmental experts, the surrounding underground oily water is the direct cause of the exceeding of benzene in the water body of artesian ditch.</P><P>Lanzhou official report said, according to the current investigation of the preliminary determination, artesian ditch around the underground oily water formation reasons are two:</P><P>First, the raw material power plant crude oil distillation workshop r205a# slag oil tank (the site was originally a raw power plant 2.5 million tons/year refinery plant, the device was built in 1982, 2003 discontinued, 2006 demolition. After demolition, the existing 400,000-ton/year aromatic extraction device was built in the original site, the tank area was designed to store distillate oil, light oil and residue oil, and a physical blasting accident occurred at 8:50 on December 28, 1987, and the tank burst caused 90 cubic residue discharge, of which 34 tons of residue was not recovered and infiltrated the underground.</P><P><!--ad200x300_2 -<Divclass= "gg200x300"><iframesrc= "Http://g.163.com/r?site=netease&affiliate=news&cat=article&type=logo300x250&location=13" width= "+"Height= "+"frameborder= "No"Border= "0"marginwidth= "0"marginheight= "0"scrolling= "No"> </iframe></Div><P>The second is the raw material power plant crude oil distillation workshop pump B-113 export manager had a cracking fire on April 3, 2002, the leakage of residual oil (the specific quantity was not counted at the time) and the fire in the process of the production of a large number of sewage infiltration into the ground.</P><P>Keung said that at present, Lanzhou petrochemical existing production equipment and tank area operation is normal, no material and product leakage phenomenon. There is no leakage phenomenon in the water seal well of the production area, and there is a small amount of floating oil found in the fire well.</P><P>The next step of investigation of the accident investigation team is to test the component of the oil-containing wastewater extracted from the excavation pit, and to further verify the correlation between the underground oily water and the excess of benzene in the self-artesian ditch from the technical aspect. At the same time, the specific leakage points in the 4th and 3rd self-flow ditch are verified on the spot, and the related responsible units and the responsible persons who cause the local water benzene exceeding the incident are investigated and verified. Finish</P>
<Divclass= "Ep-source Cdgray"> <spanclass= "Left"><ahref= "http://news.163.com/"><imgsrc= "Http://img1.cache.netease.com/cnews/css13/img/end_news.png"alt= "NetEase"width= "+"Height= " a"class= "icon"></a>The source of this article: Gao Xiang, Yinyan, Miao Liangjun</span> <spanclass= "Ep-editor">Editor: NN102</span>
</Div> </div>

The following is the implementation of the core code (but the coding process part of the code is not posted)

add : Using Scrapysharp.extensions;

namespaceHtmlagilitydemo {classProgram {Static voidMain (string[] args) {             varURI =NewUri ("http://news.163.com/14/0413/18/9PNVIBV000014JB6.html"); varBrowser1 =NewScrapingbrowser (); varHTML1 =Browser1.             Downloadstring (URI); varDoc =NewHTMLDocument (); Doc.             Loadhtml (HTML1); varHTML =Doc.            Documentnode; vartitle = HTML. Cssselect ("title"); foreach(varHtmlnodeinchtitle)             {Console.WriteLine (htmlnode.innertext); }

var ps = html. Cssselect ("P"). Cssselectancestors ("div#endtext" ); foreach(varHtmlnodeinchPS) {Console.WriteLine (htmlnode.innerhtml); } } } }

Post-run output:

Lanzhou official announced the reason for the formation of oil-contaminated water around the self-flowing ditch _ NetEase News Center, Lanzhou, April 13, Lanzhou City held a press conference this afternoon, preliminarily identified the cause of the flow of water in the self-flow ditch benzene exceeded. According to the preliminary analysis of environmental experts, the surrounding underground oily water is the direct cause of the exceeding of benzene in the water body of artesian ditch. According to the current investigation, the reasons for the formation of oily water in the vicinity of the artesian ditch are three points: first, the raw material power plant crude oil distillation workshop r205a# slag oil tank had a physical blasting accident on December 28, 1987 8:50, the tank burst caused 90 cubic residue oil, Among them, 34 tons of residue oil is not recycled, infiltration into the ground; second, raw material power plant crude oil distillation workshop pump b-113 of the export mains had cracked fire on April 3, 2002, leakage of residual oil and a large amount of fire-fighting sewage generated during the firefighting process infiltrated the ground. According to Beijing, Lanzhou, April 13 electric &nbsp; Lanzhou City "4· One"Local tap water benzene index exceeding the accident Emergency Management lead group deputy leader Keung 13th said, the investigation team from 11th 3 o'clock in the afternoon began investigation work, the excavation of the deep pit method, found the cause of the water body benzene exceeded the azimuth. According to the preliminary analysis of environmental experts, the surrounding underground oily water is the direct cause of the exceeding of benzene in the water body of artesian ditch. Lanzhou official report said, according to the current investigation of the preliminary determination, artesian ditch around the formation of oil-contaminated water two: first, the original company raw materials power plant crude oil distillation workshop r205a# slag tank (the site is the original plant of the company's raw materials 2.5 million tons/year refinery plant, the device was built in 1982, Discontinued in 2003, dismantled in 2006. After demolition, the existing 400,000-ton/year aromatic extraction device was built in the original site, the tank area was designed to store distillate oil, light oil and residue oil, and a physical blasting accident occurred at 8:50 on December 28, 1987, and the tank burst caused 90 cubic residue discharge, of which 34 tons of residue was not recovered and infiltrated the underground.<!--ad200x300_2 --The second is the raw material power plant crude oil distillation workshop pump B-113 export manager had a cracking fire on April 3, 2002, the leakage of residual oil (the specific quantity was not counted at the time) and the fire in the process of the production of a large number of sewage infiltration into the ground. Keung said that at present, Lanzhou petrochemical existing production equipment and tank area operation is normal, no material and product leakage phenomenon. There is no leakage phenomenon in the water seal well of the production area, and there is a small amount of floating oil found in the fire well. The next step of investigation of the accident investigation team is to test the component of the oil-containing wastewater extracted from the excavation pit, and to further verify the correlation between the underground oily water and the excess of benzene in the self-artesian ditch from the technical aspect. At the same time, the specific leakage points in the 4th and 3rd self-flow ditch are verified on the spot, and the related responsible units and the responsible persons who cause the local water benzene exceeding the incident are investigated and verified. (end) This article source: Gao Xiang, Yinyan, Miaoliang ArmyEditor: NN102

Then look at the output and find the residue "<!--ad200x300_2 ---"

Part of the comment code is not cleared, so deal with it,

foreach  in node. Descendants ("script"). ToList ())
{ nodescripte.remove ();} foreach in node. Descendants ("style"). ToList ()) { nodestyle.remove ();} foreach in node. Descendants ("//comment ()"). ToList ()) { nodecomment.remove ();}

Note tags that are nested inside cannot be cleaned with the above method.

Then use the following method,

foreach (var in Doc.) Documentnode.descendants ("script"). ToArray ())
{ script. Remove ();
}foreach(var in Doc. Documentnode.descendants ("style"). ToArray ())
{ style.} Remove ();
}foreach(var in Doc. Documentnode.selectnodes ("//comment ()"). ToArray ())
{ comment. Remove ();
}

After the run found, has been cleared clean.

Remove all script, style tags from the DOM tree (in order to solve the problem of the iterator not being able to remove elements from the collection while working, use the toarray () transform to iterate over the array).

Htmlagilitypack is using the XPath syntax, "//comment ()" means "all annotation nodes" in XPath.

"Supplemental" gets content from meta tags in html

Some of the related statements:

1, access to the Web Title:doc. Documentnode.selectsinglenode ("//title"). InnerText;

Explanation: "//title" in XPath represents all the title nodes. The selectSingleNode is used to get the only node that satisfies the condition.

2. Get all Hyperlinks: Doc. Documentnode.descendants ("a")

3, obtain the name of kw of input, which is equivalent to Getelementsbyname ():

var kwbox = doc. Documentnode.selectsinglenode ("//input[@name = ' kw ']");

Explanation: "//input[@name = ' kw ']" is also the XPath syntax, indicating that the Name property equals the input label of KW.

4. Other:

var divs = html.  Cssselect ("div"); All DIV elements

var nodes = html. Cssselect ("Div.content"); All DIV elements with CSS class ' content '

var nodes = html. Cssselect ("Div.widget.monthlist"); All DIV elements with the both CSS class

var nodes = html. Cssselect ("#postPaging"); All HTML elements with the ID postpaging

var nodes = html. Cssselect ("Div#postpaging.testclass"); All HTML elements with the ID postpaging and CSS class TestClass

var nodes = html. Cssselect ("Div.content > P.para"); P elements who is direct children of DIV elements with CSS class ' content '

var nodes = html. Cssselect ("Input[type=text].login"); TextBox with CSS class login

We can also select ancestors of elements:

var nodes = html. Cssselect ("P.para"). Cssselectancestors ("Div.content > Div.widget");

Reference Links:

Http://www.cnblogs.com/rupeng/archive/2012/02/07/2342012.html

Http://www.cnblogs.com/cappuccino/p/3403495.html

Http://www.cnblogs.com/dc-lancer/archive/2013/03/27/2985163.html

Http://www.cnblogs.com/sswwsw/archive/2012/12/06/2805097.html

Http://www.cnblogs.com/linfei721/archive/2013/05/08/3066697.html

Http://www.cnblogs.com/cxlings/archive/2013/05/31/3110858.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.