C # Use HtmlAgilityPack to capture web page information,
I saw a blog post a few days ago: C # Crawling Novels
The blogger uses a regular expression to obtain the name, directory, and content of a novel.
The following uses HtmlAgilityPack to rewrite the code of the original blogger:
Before using HtmlAgilityPack, familiarize yourself with XPath: Click me
The Code is as follows:
1 using System; 2 using System. IO; 3 using System. text; 4 using HtmlAgilityPack; 5 6 namespace HtmlAgilityPackDemo 7 {8 class Program 9 {10 static void Main (string [] args) 11 {12 HtmlWeb htmlWeb = new HtmlWeb (); 13 HtmlDocument document = htmlWeb. load ("http://www.23us.so/files/article/html/13/13655/index.html"); 14 FileStream fs = new FileStream ("Xinjiang .txt", FileMode. append, FileAccess. write); 15 StreamWriter sr = new StreamWriter (fs, Encoding. UTF8); 16 try17 {18 HtmlNodeCollection nodeCollection = document. documentNode. selectNodes (@ "// table/tr/td/a [@ href]"); // It indicates getting all 19 foreach (var node in nodeCollection) 20 {21 HtmlAttribute attribute = node. attributes ["href"]; 22 string val = attribute. value; 23 var title = htmlWeb. load (val ). documentNode. selectNodes (@ "// h1") [0]. innerText; // article title 24 var doc = htmlWeb. load (val ). documentNode. selectNodes (@ "// dd [@ id = 'contents']"); // article content 25 var content = doc [0]. innerHtml. replace ("& nbsp ;",""). replace ("<br>", "\ r \ n"); 26 sr. writeLine ("\ r \ n" + title + "\ r \ n" + content); // start writing 27} 28} 29 catch (Exception ex) 30 {31 Console. writeLine (ex. toString (); 32} 33 finally34 {35 sr. close (); 36 fs. close (); 37} 38 Console. writeLine ("OK"); 39 Console. readKey (true); 40 41 42} 43 44 45} 46}View Code
Achieve the same effect as the original blogger!
The code is for reference only !!!