Demand:
Need to collect JS rendering of the page, some Web pages are JS rendering
Realize:
Based on the Htmlunit implementation:
- Public static void Getajaxpage () throws exception{
- WebClient WebClient = new WebClient ();
- Webclient.setjavascriptenabled (true);
- Webclient.setcssenabled (false);
- Webclient.setajaxcontroller (new Nicelyresynchronizingajaxcontroller ());
- Webclient.settimeout (Integer.max_value);
- Webclient.setthrowexceptiononscripterror (false);
- HtmlPage rootpage = webclient.getpage ("http://tt.mop.com/read_14304066_1_0.html");
- System.out.println (Rootpage.asxml ());
- }
Maven dependencies:
- <dependency>
- <groupId>net.sourceforge.htmlunit</groupId>
- <artifactid>htmlunit-core-js</artifactid>
- <version>2.9</version>
- <scope>compile</scope>
- </Dependency>
- <dependency>
- <groupId>net.sourceforge.htmlunit</groupId>
- <artifactid>htmlunit</artifactid>
- <version>2.9</version>
- <scope>compile</scope>
- </Dependency>
Description
Nutch plugin: nutch-htmlunit to replace the HTTP fetch component of the Nutch itself
Java uses htmlunit to crawl JS rendered pages