. NET information acquisition of AJAX data

Last Update:2018-08-26 Source: Internet

Author: User

Tags baseuri xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

. NET information acquisition of AJAX data

There is a lot of data about. NET information acquisition, but how to collect it if the website is the mode of Ajax loading data asynchronously? Today to do their own information collection, encountered some of the problems and experience to share with you.

Several ways to collect websites and pros and cons:

HttpWebRequest

Using the system to bring HttpWebRequest objects, collect site content, the advantage is fast collection efficiency, but if the site is Ajax asynchronous loading of data, is not the acquisition of Web content, and the site does not use Ajax way, in the Web page using JavaScript, For example: the content of the Web page is exported to the Web page in a document.write way, this situation is also not get content. Next also need to know the other side of the website encoding format (that is, the Web page in the head of <meta charset= "Utf-8"/>), if the collection of website encoding format is wrong, will lead to the acquisition of content is garbled. But this is a small problem, I looked at the data at the time I found someone else encapsulated method, but very ashamed because I do not know who the author is, I will be the corresponding code download link to provide to everyone. The problem is that JS and Ajax need to be parsed by the browser, resulting in no access to the Web content.

Help.HttpHelp.HttpRequest (" collected URLs ");

Source

2. Browser Controls

Because at that time I developed, using the CS mode, I believe we will also use CS mode to develop this function. Since it is the CS mode (not considered beautiful) is definitely the winform,winform in the browser control, this is not good, I was using GECKOFX, based on the Firefox kernel of a browser control, but this is very little information, At that time encountered some problems can not find a solution, but later still solved. Using the control can get to the Ajax asynchronous loading of data, after the Web page loading is completed, delay a few seconds to get the content of the Web page, you can easily get to the Web page content, the disadvantage is relatively slower than the first scenario, because it is a browser control, need to render HTML and parsing JS and other operations.

GECKOFX Download

Geckowebbrowser WebBrowser =NULL; Private voidForm1_Load (Objectsender, EventArgs e) {            stringXulrunnerpath = AppDomain.CurrentDomain.BaseDirectory +"\\bin";            Xpcom.initialize (Xulrunnerpath); //set to 3 to block all pop-up windows,geckopreferences.user["Privacy.popups.disable_from_plugins"] =3; //Disallow loading of picturesgeckopreferences.user["Permissions.default.image"] =2; WebBrowser=NewGeckowebbrowser (); Webbrowser.navigate ("http://www.baidu.com"); Webbrowser.documentcompleted+=documentcompleted; }        Private voidDocumentCompleted (Objectsender, Gecko.Events.GeckoDocumentCompletedEventArgs e) {            varTime =NewSystem.Windows.Forms.Timer (); Time. Interval= -; Time. Tick+ = (A, b) = ={time.                Stop (); stringHTML =""; //Page Load Completegeckohtmlelement element =NULL; varGeckodomelement =webBrowser.Document.DocumentElement; if(Geckodomelement! =NULL&& geckodomelement isgeckohtmlelement) {element=(geckohtmlelement) geckodomelement; //Web contentHTML =element.                    InnerHtml; Txthtml.text=html
/*
To find an element of class Btnlogin with XPath
Geckonode Btnlogin = WebBrowser.Document.SelectFirst (".//*[@class = ' btnlogin ']");
if (btnlogin! = null)
{
Geckohtmlelement ie = Btnlogin as geckohtmlelement;
Manually triggering a Click event
Ie. Click ();
}*/                }            }; Time.        Start (); }

3.phantomjs

Phantomjs can interpret it as a browser control, except that it uses Qtwebkit as its core browser function, and uses the WebKit to compile and interpret the execution JavaScript code. Using this component can easily get to the content of the Web page, but also include the Ajax loaded data, if it is paged, the first load does not need to delay, if the 2nd page and above will also need to delay to obtain, and it can be very convenient to complete the page snapshot (is the page screenshot), As for the other functions, you can check the information yourself.

Phantomjs

Iwebdriver Driver =NULL; Private voidBtngo_click (Objectsender, EventArgs e) {            stringPhantomjsdire =AppDomain.CurrentDomain.BaseDirectory; Phantomjsdriverservice Service=Phantomjsdriverservice.createdefaultservice (Phantomjsdire); Service. Ignoresslerrors=true; Service. Loadimages=false; Service. Proxytype="None"; Driver=NewPhantomjsdriver (Phantomjsdire); /*Iwindow Iwindow = driver. Manage ().            Window;            Iwindow.size = new Size (10,10); Iwindow.position = new Point (0, +);*/driver. Navigate ().            Gotourl (TextBox1.Text); stringHTML =driver.            Pagesource; Txthtml.text=html; //driver.            Close (); //driver. Quit ();        }        Private voidBtnpage_click (Objectsender, EventArgs e) {            //  .//*[@class = ' Next '][text () = ' next page ']//  .//*[@class = ' text ']//  .//*[@class = ' button ')//iwebelement element = driver. Findelement (By.xpath (".//*[@class = ' text ']); //assign a value to a text box in a Web page//element. SendKeys ("4");iwebelement btnelement= driver. Findelement (By.xpath (".//*[@class = ' Next '][text () = ' next page ']"));            Btnelement.click (); varTime =NewSystem.Windows.Forms.Timer (); Time. Interval=2* +; Time. Tick+ = (A, b) = ={time.                Stop (); stringHTML =driver.                Pagesource; Txthtml.text=html;            }; Time.        Start (); }

URL address in the site content if it is a relative address, it is. /.. /a.html, if you want to obtain an absolute address, you can use the following methods:

        /// <summary>        ///get absolute URL address/// </summary>        /// <param name= "BaseUri" >Current page Address</param>        /// <param name= "Relativeuri" >Relative path Address</param>        /// <returns></returns>         Public Static stringGetrealurl (stringBaseUri,stringrelativeuri) {            Try{BaseUri=System.Web.HttpUtility.UrlDecode (BaseUri); Relativeuri=System.Web.HttpUtility.UrlDecode (relativeuri); Uri Baseurimodel=NewUri (BaseUri); Uri URI=NewUri (Baseurimodel, relativeuri); stringresult =URI.                ToString (); Baseurimodel=NULL; URI=NULL; returnresult; }            Catch(Exception ex) {}returnrelativeuri; }

Summarize:

The above mentioned 2nd, 3 ways can get to the Ajax asynchronous loading content, but also through the XPath pattern to find elements in the Web page, such as paging tags and buttons, find the element can call click Click event, you can easily solve the paging problem. A lot of web site paging to the last page, the situation is different, need to deal with themselves, such as some hide the next page button, some are disabled and so on.

After getting to the content of the Web page, to get what you need, you can go through the Htmlagilitypack plugin, which is looking for content in XPath mode.

I will send out the information collection system that I have developed below.

Any form of reprint is welcome, but please be sure to indicate the source.

Copywriting Limited, the code word is not easy, do not like to spray, if the article and code are not described in the wrong place, please do not hesitate to enlighten.

. NET information acquisition of AJAX data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More