Don't say much nonsense, just say the demand.
The company's web site needs to crawl other website articles, but the task did not come to me, colleagues engaged in the afternoon did not get out. Having just come to the company, want to prove oneself, put the life to come over. Because I have done before, think it should be very simple, but when I started to do, I collapsed, the HTTP request, get the string is garbled, and then the various Baidu (Google has been crashing), and finally found the reason. Because I want to crawl the page to do the compression, so when I grabbed, grabbed is compressed, so must be decompressed, if not, no matter what coding method, the result is garbled. Directly on the code:
1Public Encoding GetEncoding (StringCharacterSet)2{3Switch(CharacterSet)4{5case gb2312 ": return encoding.getencoding (" gb2312 " 6 case utf-8 ": return Encoding.utf8; 7 default: return8 }9}
View Code
PublicString HttpGet (StringURL) {String responsestr =""; HttpWebRequest req = httpwebrequest.create (URL)AsHttpWebRequest; Req. Accept ="*/*"; Req. Method ="GET"; Req. UserAgent ="mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/21.0.1180.89 safari/537.1";using (httpwebresponse response = req. GetResponse ()AsHttpWebResponse) {stream stream;if (response. Contentencoding.tolower (). Contains ("Gzip") {stream =NewGZipStream (response. GetResponseStream (), compressionmode.decompress); }Elseif (response. Contentencoding.tolower (). Contains ( "deflate" Span style= "color: #000000;" ) {stream = new Deflatestream (response. GetResponseStream (), compressionmode.decompress); } else Response. GetResponseStream (); } using (StreamReader reader = new StreamReader (stream, getencoding (response. CharacterSet)) {responsestr = reader. ReadToEnd (); Stream. Dispose (); }} return Responsestr;}
Call HttpGet can get the source of the URL, get the source code, now with a sharp weapon htmlagility to parse the HTML, it does not matter, this is the artifact ah. The boss doesn't have to worry about my regular expressions any more.
As for the use of this artifact, the garden article a lot, writing is also very detailed, here is not redundant.
Here is a list of the articles that crawl the Garden home page:
String html = HttpGet ("http://www.cnblogs.com/"); HTMLDocument doc =NewHTMLDocument (); Doc. loadhtml (HTML);//Get a list of articlesvar artlist = doc. Documentnode.selectnodes ("div[@class = ' Post_item ')");foreach (var item in< Span style= "color: #000000;" > Artlist) {HTMLDocument ADOc = new HTMLDocument (); ADOc. Loadhtml (item. InnerHtml); var html_a = ADOc. Documentnode.selectsinglenode ( "//a[@class = ' Titlelnk '] "); Response.Write (string. Format ( " ", Html_a.innertext,html_a.attributes[" href "". Value)); }
Run results
Finish the call.