Simulating Ajax to implement web crawler--htmlunit

Source: Internet
Author: User
Tags throw exception

Recently in the use of Jsoup crawl a Web site data, some pages is the AJAX request dynamically generated, to ask in the group, the big God said analog Ajax request can be. To search the Internet, found this article, take it to try first.

Reprint as follows:

There are many ways to implement web crawler online, but many do not support Ajax, brother Li said: "Simulation is the kingly way." Indeed, if you can emulate a browser without an interface, what else can't be done? There are a lot of frameworks for parsing Ajax sites, I chose Htmlunit, official website: http://htmlunit.sourceforge.net/,htmlunit can be said to be a Java version of the interface browser, almost omnipotent, And a lot of things are perfectly packaged. This is the past few days to accumulate the painstaking efforts, record.

Package Com.lanyotech.www.wordbank;import Java.io.FileOutputStream; Import java.io.IOException; Import Java.io.InputStream; Import Java.io.OutputStream; Import java.net.MalformedURLException; Import java.util.List; Import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; Import Com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController; Import Com.gargoylesoftware.htmlunit.ScriptResult; Import com.gargoylesoftware.htmlunit.WebClient; Import com.gargoylesoftware.htmlunit.html.HtmlOption; Import Com.gargoylesoftware.htmlunit.html.HtmlPage; Import Com.gargoylesoftware.htmlunit.html.HtmlSelect;         public class Worldbankcrawl {private static String Target_url = "Http://databank.worldbank.org/ddp/home.do";         public static void Main (string[] args) throws Failinghttpstatuscodeexception, Malformedurlexception, IOException {         Simulates a browser WebClient WebClient = new WebClient ();         Sets the relevant parameters of the WebClient webclient.setjavascriptenabled (true);Webclient.setcssenabled (FALSE);         Webclient.setajaxcontroller (New Nicelyresynchronizingajaxcontroller ());         Webclient.settimeout (35000);         Webclient.setthrowexceptiononscripterror (FALSE);         Simulation browser opens a destination URL htmlpage rootpage= webclient.getpage (Target_url);         Get the first database HtmlSelect HS = (HtmlSelect) Rootpage.getelementbyid ("Lstcubes");         Select the first database hs.getoption (0) as required. SetSelected (TRUE);         Analog Click on the Next button to jump to the second page System.out.println ("Jumping ...");                 Execute the button to start the JS event Scriptresult sr = Rootpage.executejavascript ("Javascript:setcubedata (2,-1,4, '/DDP ');");         Jump to the second page, select country htmlpage countryselect = (htmlpage) sr.getnewpage ();         Get a selection box page with all country information HtmlPage framepage= (htmlpage) countryselect.getframebyname ("Frmtree1″"). Getenclosedpage (); Get SelectAll button to trigger JS event Framepage.executejavascript ("Javascript:transferlistall (' countrylst ', ' countrylstselected ', ' no '); Setselectedcount (' countrylstselected ', ' tdcount '); Get Next button, Trigger JS event Scriptresult Electricityscriptresult = Framepage.executejavascript ("Javascript:wrappersetcube ('/                 DDP ') ");         System.out.println ("jumping ...");         Jump to next page electricityselect htmlpage electricityselect = (htmlpage) electricityscriptresult.getnewpage (); Get the electricity selected iframe HtmlPage electricityframe = (htmlpage) electricityselect.getframebyname ("FrmTree1″"). GetE         Nclosedpage ();         Get selection box HtmlSelect Seriesselect = (htmlselect) Electricityframe.getelementbyid ("Countrylst");         Get all the selection box contents List optionlist = seriesselect.getoptions ();         Selects the specified option Optionlist.get (1). SetSelected (True); Analog Click on the Select button Electricityframe.executejavascript ("Javascript:transferlist (' countrylst ', ' countrylsts Elected ', ' no ');         Setselectedcount (' countrylstselected ', ' tdcount '); Gets selected, the following selection box HtmlSelect electricityselected = (HtmLselect) Electricityframe.getelementbyid ("countrylstselected");         List List = Electricityselected.getoptions (); Analog Click on the Next button to jump to the selected time page Scriptresult timescriptresult = Electricityframe.executejavascript ("javascript:wrappersetc                 Ube ('/DDP ') ");         System.out.println ("jumping ...");         HtmlPage timeselectpage = (htmlpage) timescriptresult.getnewpage ();         Gets the selected time selection box Timeselectpage = (htmlpage) timeselectpage.getframebyname ("Frmtree1″"). Getenclosedpage (); Check all time Timeselectpage.executejavascript ("Javascript:transferlistall" (' Countrylst ', ' countrylstselected ', ' n O ');         Setselectedcount (' countrylstselected ', ' tdcount ');                 Click the Next button Scriptresult Exportresult = Timeselectpage.executejavascript ("Javascript:wrappersetcube ('/DDP ')");         System.out.println ("jumping ...");         Go to export page htmlpage exportpage = (htmlpage) exportresult.getnewpage (); Click the Export button on the page to go to the download page ScriptresulT downresult = Exportpage.executejavascript ("Javascript:exportdata" ('/DDP ', ' ext_bulk ', ' wdi_time=51| | wdi_series=1| |                 wdi_ctry=244| | ');         System.out.println ("jumping ...");         HtmlPage downloadpage = (htmlpage) downresult.getnewpage (); Click on the Excel icon to start downloading Scriptresult Downloadresult = Downloadpage.executejavascript ("Javascript:exportdata ('/DDP ', ' BULKE         XCEL ');                 Download Excel file InputStream is = Downloadresult.getnewpage (). GetWebResponse (). Getcontentasstream ();         OutputStream fos = new FileOutputStream ("D://test.xls");         Byte[] Buffer=new byte[1024*30];         int len=-1;         while ((Len=is.read (buffer)) >0) {fos.write (buffer, 0, Len);         } fos.close ();         Fos.close ();     System.out.println ("success!");  } }

 

Comments:

  /**htmlunit Request Web Page *          /WebClient WC = new WebClient ();          Wc.getoptions (). Setjavascriptenabled (True); Enables the JS interpreter, which defaults to True          wc.getoptions (). setcssenabled (false);//disable CSS Support          wc.getoptions (). Setthrowexceptiononscripterror (FALSE); JS run error, whether to throw exception          wc.getoptions (). SetTimeout (10000);//Set connection timeout, here is 10S. If 0, waits indefinitely          htmlpage page = wc.getpage ("http://cq.qq.com/baoliao/detail.htm?294064");          


Simulating Ajax to implement web crawler--htmlunit

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.