Crawl Web pages with HttpWebRequest and Htmlagilitypack (reject garbled, reject regular expressions)

Source: Internet
Author: User

Don't say much nonsense, just say the demand.

The company's web site needs to crawl other website articles, but the task did not come to me, colleagues engaged in the afternoon did not get out. Having just come to the company, want to prove oneself, put the life to come over. Because I have done before, think it should be very simple, but when I started to do, I collapsed, the HTTP request, get the string is garbled, and then the various Baidu (Google has been crashing), and finally found the reason. Because I want to crawl the page to do the compression, so when I grabbed, grabbed is compressed, so must be decompressed, if not, no matter what coding method, the result is garbled. Directly on the code:

1Public Encoding GetEncoding (StringCharacterSet)2{3Switch(CharacterSet)4{5case gb2312 ": return encoding.getencoding (" gb2312 " 6 case  utf-8 ": return Encoding.utf8; 7 default: return8 }9}     
View Code
  PublicString HttpGet (StringURL) {String responsestr =""; HttpWebRequest req = httpwebrequest.create (URL)AsHttpWebRequest; Req. Accept ="*/*"; Req. Method ="GET"; Req. UserAgent ="mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/21.0.1180.89 safari/537.1";using (httpwebresponse response = req. GetResponse ()AsHttpWebResponse) {stream stream;if (response. Contentencoding.tolower (). Contains ("Gzip") {stream =NewGZipStream (response. GetResponseStream (), compressionmode.decompress); }Elseif (response. Contentencoding.tolower (). Contains ( "deflate" Span style= "color: #000000;" ) {stream = new Deflatestream (response. GetResponseStream (), compressionmode.decompress); } else Response. GetResponseStream (); } using (StreamReader reader = new StreamReader (stream, getencoding (response. CharacterSet)) {responsestr = reader. ReadToEnd (); Stream. Dispose (); }} return Responsestr;}      


Call HttpGet can get the source of the URL, get the source code, now with a sharp weapon htmlagility to parse the HTML, it does not matter, this is the artifact ah. The boss doesn't have to worry about my regular expressions any more.

As for the use of this artifact, the garden article a lot, writing is also very detailed, here is not redundant.

Here is a list of the articles that crawl the Garden home page:

 String html = HttpGet ("http://www.cnblogs.com/"); HTMLDocument doc =NewHTMLDocument (); Doc. loadhtml (HTML);//Get a list of articlesvar artlist = doc. Documentnode.selectnodes ("div[@class = ' Post_item ')");foreach (var item in< Span style= "color: #000000;" > Artlist) {HTMLDocument ADOc = new HTMLDocument (); ADOc. Loadhtml (item. InnerHtml); var html_a = ADOc. Documentnode.selectsinglenode ( "//a[@class = ' Titlelnk '] "); Response.Write (string. Format ( " ", Html_a.innertext,html_a.attributes[" href "". Value)); } 

Run results

Finish the call.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.