Crawler Ajax web page (Cobra)

Source: Internet
Author: User
Http://lobobrowser.org/cobra.jsp

Pages with JS logic pose a major obstacle to crawling web crawler information. The DOM tree can be fully presented only when JavaScript logic is executed. Sometimes, parse the modified DOM tree of JavaScript. After searching for a large amount of information, I found an open-source project Cobra. Cobra supports the JavaScript engine. Its built-in JavaScript Engine is Rhino under Mozilla. It uses the rhino API to implement the interpretation and execution of JavaScript embedded in HTML. Test cases:

Js.html

<HTML>

<Title> test JavaScript </title>

<Script language = "JavaScript">

VaR go = function (){



Document. getelementbyid ("GG"). innerhtml = "google ";

}

</SCRIPT>

<Body onload = "javascript: Go ();">

<A id = "GG" onclick = "javascript: Go ();" href = "#"> Baidu </a>

</Body>

</Html>

Test. Java

Package net. cooleagle. Test. Cobra;

Import java. Io. inputstream;

Import java. Io. inputstreamreader;

Import java. Io. reader;

Import java.net. url;




Import org.lobobrowser.html. useragentcontext;

Import org.lobobrowser.html. domimpl. html#entimpl;

Import org.lobobrowser.html. parser. documentbuilderimpl;

Import org.lobobrowser.html. parser. inputsourceimpl;

Import org.lobobrowser.html. Test. simpleuseragentcontext;

Import org. W3C. Dom. Document;

Import org. W3C. Dom. element;




Public class test {

Private Static final string test_uri = "http: // localhost/js.html ";



Public static void main (string [] ARGs) throws exception {

Useragentcontext uacontext = new simpleuseragentcontext ();

Documentbuilderimpl builder = new documentbuilderimpl (uacontext );

URL url = new URL (test_uri );

Inputstream in = URL. openconnection (). getinputstream ();

Try {

Reader reader = new inputstreamreader (in, "ISO-8859-1 ");

Inputsourceimpl inputsource = new inputsourceimpl (reader, test_uri );

Document d = builder. parse (inputsource );

Html#entimpl document = (html#entimpl) D;

Element ele = Document. getelementbyid ("GG ");

System. Out. println (Ele. gettextcontent ());



} Finally {

In. Close ();

}

}

}

Execution result:

Google

The test is successful.

========================================================== ====

I originally used jrex, a Java wrapper for the Mozilla Gecko Layout Engine, to render HTML pages. I was looking for a better engine for extracting the HTML of rendered pages and found the Cobra toolkit that is part of the Lobo project. this project includes des the Cobra toolkit that renders HTML and the lobobrowser built on this toolkit. the code is pure Java.

My initial comparison of jrex and Cobra found the following salient facts:

  • Jrex seems to be an abandoned project while the Lobo project is active. The forums for this project are more active than for jrex.
  • While jrex appears to be abandoned, gecko is a world-class rendering engine. Cobra still seems to be in development.
  • Jrex crashes the Java JVM when loading certain pages, and Cobra does not.
  • Cobra can be run headless while jrex/gecko cannot. Cobra seems faster since it doesn't have to actually render the HTML page to a graphic context.
  • By default, jrex/gecko has des a flash plug-in while Cobra does not. (since the plug-in mechanic for the lobobrowser requires Java code, plug-ins for other browsers will not work. until a Java flash plug-in is available, Cobra will not handle flash .) the javascript in some pages will cause a modified page to be loaded if Flash isn' t present. in some data mining tasks, being able to examine the <Object> and <embed> tags is useful and might not be available in Cobra unless a plug-in for flash is installed.
  • Jrex/gecko seems to handle less well-formed HTML than cobra. A missing <HTML> or

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.