C # HTML parsing tool-Html Agility Pack,
It's a bit late today. My design requires recommendations for movies that crawl Douban, so I need to parse the crawled html. I used Python for parsing, but currently I am using C #, I think C # is no worse than python, and Microsoft is a big contributor. This does not need to be worried, but it is mainly due to ecological problems. I checked the information and found that the Html Agility Pack is better. Of course there are other things. I will not talk about it, but I will mainly use it.
Official Website address (you can download the dll yourself ):
Http://html-agility-pack.net/select-nodes
Reference: introduction and application of Html Agility Pack basic classes
Code Design:
Static void complete (object o, AsyncCompletedEventArgs e) {// start parsing html var doc = new HtmlDocument (); doc. load ("E: \ Program file \ C # program code \ Validate \ ConsoleApplication1 \ movie.txt", Encoding. UTF8); List <string> movie = new List <string> (); // HtmlNodeCollection nodeCollection = doc. documentNode. selectNodes ("// ul/li [class = \" title \ "]"); foreach (HtmlNode n in nodeCollection) {Console. writeLine (n. innerHtml. trim (); movie. add (n. innerText. trim ();} // obtain the most popular image of. HtmlNodeCollection nodeCollection1 = doc. documentNode. selectNodes ("// div [class = \" review-bd \ "]/h3"); foreach (HtmlNode n in nodeCollection1) {Console. writeLine (n. innerHtml. trim (); movie. add (n. innerText. trim ();} foreach (var m in movie) {Console. writeLine (m);} File. delete ("E: \ Program file \ C # program code \ Validate \ ConsoleApplication1 \ movie.txt");} static void Main (string [] args) {Console. bufferHeight = 10000; Console. bufferWidth = 10000; string moviePath = "E: \ Program file \ C # program code \ Validate \ ConsoleApplication1 \ movie.txt"; WebClient wc = new WebClient (); wc. useDefaultCredentials = true; wc. downloadFileAsync (new Uri (" https://movie.douban.com/ "), MoviePath); wc. DownloadFileCompleted + = new AsyncCompletedEventHandler (complete); Console. Read ();}
For the WebClient documentation, see https://msdn.microsoft.com/zh-cn/library/system.net.webclient (v = vs.110). aspx
I have to say that the series of documents on the Microsoft official website are really Conscientious! I have heard people say that Microsoft's solutions and documentation are very comprehensive, but I have been looking for information directly from Baidu. Now I have changed the method and checked it on the official Microsoft website! The example is classic!