Recently in the use of Jsoup crawl a Web site data, some pages is the AJAX request dynamically generated, to ask in the group, the big God said analog Ajax request can be. To search the Internet, found this article, take it to try first.
Reprint as follows:
There are many ways to implement web crawler online, but many do not support Ajax, brother Li said: "Simulation is the kingly way." Indeed, if you can emulate a browser without an interface, what else can't be done? There are a lot of frameworks for parsing Ajax sites, I chose Htmlunit, official website: http://htmlunit.sourceforge.net/,htmlunit can be said to be a Java version of the interface browser, almost omnipotent, And a lot of things are perfectly packaged. This is the past few days to accumulate the painstaking efforts, record.
Package Com.lanyotech.www.wordbank;import Java.io.FileOutputStream; Import java.io.IOException; Import Java.io.InputStream; Import Java.io.OutputStream; Import java.net.MalformedURLException; Import java.util.List; Import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; Import Com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController; Import Com.gargoylesoftware.htmlunit.ScriptResult; Import com.gargoylesoftware.htmlunit.WebClient; Import com.gargoylesoftware.htmlunit.html.HtmlOption; Import Com.gargoylesoftware.htmlunit.html.HtmlPage; Import Com.gargoylesoftware.htmlunit.html.HtmlSelect; public class Worldbankcrawl {private static String Target_url = "Http://databank.worldbank.org/ddp/home.do"; public static void Main (string[] args) throws Failinghttpstatuscodeexception, Malformedurlexception, IOException { Simulates a browser WebClient WebClient = new WebClient (); Sets the relevant parameters of the WebClient webclient.setjavascriptenabled (true);Webclient.setcssenabled (FALSE); Webclient.setajaxcontroller (New Nicelyresynchronizingajaxcontroller ()); Webclient.settimeout (35000); Webclient.setthrowexceptiononscripterror (FALSE); Simulation browser opens a destination URL htmlpage rootpage= webclient.getpage (Target_url); Get the first database HtmlSelect HS = (HtmlSelect) Rootpage.getelementbyid ("Lstcubes"); Select the first database hs.getoption (0) as required. SetSelected (TRUE); Analog Click on the Next button to jump to the second page System.out.println ("Jumping ..."); Execute the button to start the JS event Scriptresult sr = Rootpage.executejavascript ("Javascript:setcubedata (2,-1,4, '/DDP ');"); Jump to the second page, select country htmlpage countryselect = (htmlpage) sr.getnewpage (); Get a selection box page with all country information HtmlPage framepage= (htmlpage) countryselect.getframebyname ("Frmtree1″"). Getenclosedpage (); Get SelectAll button to trigger JS event Framepage.executejavascript ("Javascript:transferlistall (' countrylst ', ' countrylstselected ', ' no '); Setselectedcount (' countrylstselected ', ' tdcount '); Get Next button, Trigger JS event Scriptresult Electricityscriptresult = Framepage.executejavascript ("Javascript:wrappersetcube ('/ DDP ') "); System.out.println ("jumping ..."); Jump to next page electricityselect htmlpage electricityselect = (htmlpage) electricityscriptresult.getnewpage (); Get the electricity selected iframe HtmlPage electricityframe = (htmlpage) electricityselect.getframebyname ("FrmTree1″"). GetE Nclosedpage (); Get selection box HtmlSelect Seriesselect = (htmlselect) Electricityframe.getelementbyid ("Countrylst"); Get all the selection box contents List optionlist = seriesselect.getoptions (); Selects the specified option Optionlist.get (1). SetSelected (True); Analog Click on the Select button Electricityframe.executejavascript ("Javascript:transferlist (' countrylst ', ' countrylsts Elected ', ' no '); Setselectedcount (' countrylstselected ', ' tdcount '); Gets selected, the following selection box HtmlSelect electricityselected = (HtmLselect) Electricityframe.getelementbyid ("countrylstselected"); List List = Electricityselected.getoptions (); Analog Click on the Next button to jump to the selected time page Scriptresult timescriptresult = Electricityframe.executejavascript ("javascript:wrappersetc Ube ('/DDP ') "); System.out.println ("jumping ..."); HtmlPage timeselectpage = (htmlpage) timescriptresult.getnewpage (); Gets the selected time selection box Timeselectpage = (htmlpage) timeselectpage.getframebyname ("Frmtree1″"). Getenclosedpage (); Check all time Timeselectpage.executejavascript ("Javascript:transferlistall" (' Countrylst ', ' countrylstselected ', ' n O '); Setselectedcount (' countrylstselected ', ' tdcount '); Click the Next button Scriptresult Exportresult = Timeselectpage.executejavascript ("Javascript:wrappersetcube ('/DDP ')"); System.out.println ("jumping ..."); Go to export page htmlpage exportpage = (htmlpage) exportresult.getnewpage (); Click the Export button on the page to go to the download page ScriptresulT downresult = Exportpage.executejavascript ("Javascript:exportdata" ('/DDP ', ' ext_bulk ', ' wdi_time=51| | wdi_series=1| | wdi_ctry=244| | '); System.out.println ("jumping ..."); HtmlPage downloadpage = (htmlpage) downresult.getnewpage (); Click on the Excel icon to start downloading Scriptresult Downloadresult = Downloadpage.executejavascript ("Javascript:exportdata ('/DDP ', ' BULKE XCEL '); Download Excel file InputStream is = Downloadresult.getnewpage (). GetWebResponse (). Getcontentasstream (); OutputStream fos = new FileOutputStream ("D://test.xls"); Byte[] Buffer=new byte[1024*30]; int len=-1; while ((Len=is.read (buffer)) >0) {fos.write (buffer, 0, Len); } fos.close (); Fos.close (); System.out.println ("success!"); } }
Comments:
/**htmlunit Request Web Page * /WebClient WC = new WebClient (); Wc.getoptions (). Setjavascriptenabled (True); Enables the JS interpreter, which defaults to True wc.getoptions (). setcssenabled (false);//disable CSS Support wc.getoptions (). Setthrowexceptiononscripterror (FALSE); JS run error, whether to throw exception wc.getoptions (). SetTimeout (10000);//Set connection timeout, here is 10S. If 0, waits indefinitely htmlpage page = wc.getpage ("http://cq.qq.com/baoliao/detail.htm?294064");
Simulating Ajax to implement web crawler--htmlunit