Data Acquisition Practice Learning II (C #)

Source: Internet
Author: User

The previous article wrote that I didn't get the data by parsing the HTML, but by parsing the request link and then simulating the request method to get the data, which is just one way. And in the case that I can't get through the analysis of HTML, the curve to save the nation, refer to other people's article to achieve. I'm glad that I've achieved my goal of getting the data. I thought it was over. However, today found another way, and through the analysis of HTML implementation, see it, I feel too incredible, I spent so much time has not been realized, how now again can. Now the interest is strong, quickly fencing practice. So with this article, is an accident of joy!

First explain the implementation of the idea, originally it is by invoking the WebBrowser control to achieve. No wonder it can get HTML and then parse to get the data. Whatever you do. Dynamic parsing, Ajax, now I am browser behavior, all of them can not escape my discernment. Really is a good way to choose.

A description, including three places.


A parse gets parse HTML class, an event class, a call place. The last time I was to take that sex site experiment, the result everyone said I was good dirt, good pollution, in fact, I am a nice person, a good person to let everyone have motive interest, code to write tired, look at pictures, and chicken blood, I do not believe that we are not interested in the United States. Learn to walk with joy and entertain yourself. OK, this time to avoid everyone's idea, I take our blog garden experiment, I just get the front three pages, too much is the same effect, no need, explain the method feasible on it.

Start the code.
An analytic class

usingSystem;usingSystem.Collections.Generic;usingSystem.Linq;usingSystem.Text;usingSystem.Threading;usingSystem.Threading.Tasks;usingSystem.Windows.Forms;namespacewebbrowsercrawlerdemo{//This feeling is only suitable for single page data fetching//can crawl multiple pages, such as Blog Park data//http://www.cnblogs.com/rookey/p/5019090.html    /// <summary>    ///crawl Web data through WebBrowser///Webbrowsercrawler webbrowsercrawler=new Webbrowsercrawler (); ///Example: File.writealltext (Server.MapPath ("Sample.txt"), Webbrowsercrawler.getreult (http://www.in2.cc/sample/waterfalllab.htm)); /// </summary>     Public  classWebbrowsercrawler {//WebBrowser        PrivateWebBrowser _webbrowder; //the final results        Private string_result {Get;Set; } //Web site        Private string_path {Get;Set; } //when the data is being crawled, the maximum number of seconds allowed to wait, time-out (in seconds)        Private int_maxwaitseconds {Get;Set; }  Public Delegate BOOLMyDelegate (Objectsender, Testeventargs e); /// <summary>        ///whether to reach the stop load condition/// </summary>         Public EventMyDelegate isstopevent; /// <summary>        ///the method for Grandpa/// </summary>        /// <param name= "url" >URL Path</param>        /// <param name= "Maxwaitseconds" >Maximum wait seconds</param>        /// <returns></returns>         Public stringGetreult (stringUrlintMaxwaitseconds = -) {_path=URL; _maxwaitseconds= Maxwaitseconds <=0? -: maxwaitseconds; varMthread =NewThread (Fatchdatatoresult); //Apartment is a logically container that allows objects to share the same threading access requirements as they do in the order. All objects within the same Apartment can receive any of the Apartment//the. NET Framework does not use apartment;managed objects must be used in a secure Way (Thread-safe)//because COM classes use Apartment, Common Language Runtime calls out COM objects in the context of COM Interop, and it is necessary to establish Apartment and start//Managed can be built and entered into a single, Apartment (STA) that only allows one to perform, or a multi-threading Apartment (MT) with more than one threading//as long as the apartmentstate of the executive is set to one of the ApartmentState (enumeration), you can control which of the Apartment is established.//because a particular executive can only initialize a COM Apartment at a time, the first call to unmanaged code does not change any more Apartment//From :http://msdn.microsoft.com/zh-tw/library/system.threading.apartmentstate.mthread.setapartmentstate (ApartmentState.STA);            Mthread.start ();            Mthread.join (); return_result; }        /// <summary>        ///Call _webbrowder fetching information///For thread call/// </summary>        Private voidFatchdatatoresult () {_webbrowder=NewWebBrowser (); _webbrowder.scripterrorssuppressed=true;            _webbrowder.navigate (_path); DateTime Firsttime=DateTime.Now; //Handle all Windows currently in the Message NIN column//If you call DoEvents in the code, your application can handle other events. For example, if your form adds information to the ListBox and adds DoEvents to the code, when another window is dragged onto your list, the table will//If you remove DoEvents from the code, your form will not be re-drawn until the button-click event handler//by constantly looping through the entire page load, and then get the information you want. Can be combined with this jumonyparser .             while((Datetime.now-firsttime). TotalSeconds <=_maxwaitseconds) {                if(_webbrowder.document! =NULL&& _webbrowder.document.body! =NULL&&!string. IsNullOrEmpty (_webbrowder.document.body.outerhtml) && This. Isstopevent! =NULL)                {                    stringHTML =_webbrowder.document.body.outerhtml; BOOLrs = This. Isstopevent (NULL,NewTesteventargs (HTML)); if(RS) { This. _result =html;  Break;            }} application.doevents ();        } _webbrowder.dispose (); }    }}

Event class

usingSystem;usingSystem.Collections.Generic;usingSystem.Linq;usingSystem.Text;usingSystem.Threading.Tasks;namespacewebbrowsercrawlerdemo{ Public classTesteventargs:eventargs { Public stringHtml {Get;Set; }  PublicTesteventargs (stringHTML2) {            This. Html =HTML2; }    }}

The caller comes with an interface first.

Code

usingSystem;usingSystem.Collections.Generic;usingSystem.ComponentModel;usingSystem.Data;usingSystem.Drawing;usingSystem.IO;usingSystem.Linq;usingSystem.Text;usingSystem.Threading.Tasks;usingSystem.Windows.Forms;namespacewebbrowsercrawlerdemo{ Public Partial classForm1:form { PublicForm1 () {InitializeComponent (); }         Public voidTestintnum) {Webbrowsercrawler obj=NewWebbrowsercrawler (); Obj. Isstopevent+=NewWebbrowsercrawler.mydelegate (sender, e) = =            {                //The data i want is already loaded in the current HTML and returns True//            //return E.html.contains ("<div id=\" post_list\ ">");                returnE.html.contains ("<div class=\ "post_item\" >");       });             stringURL =string. Format ("http://www.cnblogs.com/#p {0}", num); stringhtml = obj. Getreult (URL);//get the collected data            if(!string. IsNullOrEmpty (HTML)) {//working with DataWrite (HTML); }        }        Private voidbtnTest_Click (Objectsender, EventArgs e) {             for(inti =1; I <4; i++) {test (i); }        }        //http://www.cnblogs.com/akwwl/p/3240813.html         Public voidWrite (stringhtml) {            stringPath =@"d:\ Practice \mypicturedownloader\webbrowsercrawlerdemo\bin\debug\test\test.txt"; FileStream FS=NewFileStream (path, filemode.append); //get byte array            byte[] data =System.Text.Encoding.Default.GetBytes (HTML); //Start WritingFs. Write (data,0, data.            Length); //empties the buffer, closes the streamFS.            Flush (); Fs.        Close (); }    }}

Explain, my data is saved to the TXT file, not to analyze what the target data, as long as the entire page to get it, I was saved by appending form.

E.html.contains ("<div id=\" post_list\ ">"); Analysis why not this, I use it results to get no data. It turns out this is the case.

Returned is the HTML element format, through it, the request has not finished, did not obtain the data, certainly does not have. and changed it to the top one. Data can be obtained.
As a result, I only get three pages so three <body> tags, I also test the comparison, the fact is three pages of data.

If you still want to get the target data, you can use some HTML parsing classes such as: Jumony,htmlagilitypack.

Well, it's already off duty. The contents are also covered.

Reference:
Http://www.cnblogs.com/rookey/p/5019090.html
Http://www.cnblogs.com/akwwl/p/3240813.html

Data Acquisition Practice Learning II (C #)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.