I'm just starting out today. Well, it's a little late. My bi design needs to crawl the Watercress movie recommendation, so I need to parse the crawled down the HTML, before using Python to play the parsing, but at present I use C #, I think C # is not worse than Python, there is Microsoft greatly, this does not need to worry, mainly ecological problems. Check the information, found that the HTML Agility pack is better, of course, there are other, I will not say, mainly use it to do.
Official address (you can download the DLL yourself):
Http://html-agility-pack.net/select-nodes
Reference: Html Agility Pack Basic class Introduction and application
Code Design:
Static voidCompleteObjecto, AsyncCompletedEventArgs e) { //Start parsing HTML varDoc =NewHTMLDocument (); Doc. Load ("e:\ Program Files \c# program code \validate\consoleapplication1\movie.txt", Encoding.UTF8); List<string> movie =Newlist<string>(); //Htmlnodecollection nodecollection = doc. Documentnode.selectnodes ("//ul/li[class=\ "title\"]"); foreach(Htmlnode Ninchnodecollection) {Console.WriteLine (N.innerhtml.trim ()); Movie. ADD (N.innertext.trim ()); } //get the most popular film critics of WatercressHtmlnodecollection NodeCollection1 = doc. Documentnode.selectnodes ("//div[class=\ "review-bd\"]/h3"); foreach(Htmlnode NinchNodeCollection1) {Console.WriteLine (N.innerhtml.trim ()); Movie. ADD (N.innertext.trim ()); } foreach(varMinchmovie) {Console.WriteLine (M); } file.delete ("e:\ Program Files \c# program code \validate\consoleapplication1\movie.txt"); } Static voidMain (string[] args) {Console.bufferheight=10000; Console.bufferwidth=10000; stringMoviepath ="e:\ Program Files \c# program code \validate\consoleapplication1\movie.txt"; WebClient WC=NewWebClient (); Wc. useDefaultCredentials=true; Wc. DownloadFileAsync (NewUri ("https://movie.douban.com/"), Moviepath); Wc. Downloadfilecompleted+=NewAsynccompletedeventhandler (complete); Console.read (); }
For WebClient documents, see Https://msdn.microsoft.com/zh-cn/library/system.net.webclient (v=vs.110). aspx
I have to say, the Microsoft Official website series of documents is really conscience! Before also heard people said, Microsoft's solution and documentation is full, but has been to check the data are directly Baidu, now a change, directly on the Microsoft official website to check .... That's a conscience! And the example is more classic!
C # parsing Html tool-html Agility Pack