爬蟲 ajax網頁（Cobra）

最後更新：2018-12-07 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

http://lobobrowser.org/cobra.jsp

有js邏輯的頁面，對網路爬蟲的資訊抓取工作造成了很大障礙。DOM樹，只有執行了js的邏輯才可以完整的呈現。而有的時候，有要對js修改後的 dom樹進行解析。在搜尋了大量資料後，發現了一個開源的項目cobra。cobra支援JavaScript引擎，其內建的JavaScript引擎是 mozilla下的 rhino,利用rhino的API，實現了對嵌入在html的JavaScript的解釋執行。測試案例：

js.html

<html>

<title>test javascript</title>

<script language="javascript">

var go = function(){

document.getElementById("gg").innerHTML="google";

}

</script>

<body onLoad="javascript:go();">

<a id = "gg" onClick="javascript:go();" href="#">baidu</a>

</body>

</html>

Test.java

package net.cooleagle.test.cobra;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.Reader;

import java.net.URL;

import org.lobobrowser.html.UserAgentContext;

import org.lobobrowser.html.domimpl.HTMLDocumentImpl;

import org.lobobrowser.html.parser.DocumentBuilderImpl;

import org.lobobrowser.html.parser.InputSourceImpl;

import org.lobobrowser.html.test.SimpleUserAgentContext;

import org.w3c.dom.Document;

import org.w3c.dom.Element;

public class Test{

private static final String TEST_URI = "http://localhost/js.html";

public static void main(String[] args) throws Exception {

UserAgentContext uacontext = new SimpleUserAgentContext();

DocumentBuilderImpl builder = new DocumentBuilderImpl(uacontext);

URL url = new URL(TEST_URI);

InputStream in = url.openConnection().getInputStream();

try {

Reader reader = new InputStreamReader(in, "ISO-8859-1");

InputSourceImpl inputSource = new InputSourceImpl(reader, TEST_URI);

Document d = builder.parse(inputSource);

HTMLDocumentImpl document = (HTMLDocumentImpl) d;

Element ele = document.getElementById("gg");

System.out.println(ele.getTextContent());

} finally {

in.close();

}

}

}

執行結果：

google

測試成功。

============================================

I originally used JRex, a Java wrapper for the Mozilla Gecko layout engine, to render HTML pages. I was looking for a better engine for extracting the HTML of rendered pages and found the Cobra Toolkit that is part of the Lobo Project. This project includes the Cobra Toolkit that renders HTML and the LoboBrowser built on this toolkit. The code is pure Java.

My initial comparison of JRex and Cobra found the following salient facts:

JRex seems to be an abandoned project while the Lobo Project is active. The forums for this project are more active than for JRex.
While JRex appears to be abandoned, Gecko is a world-class rendering engine. Cobra still seems to be in development.
JRex crashes the Java JVM when loading certain pages, and Cobra does not.
Cobra can be run headless while JRex/Gecko cannot. Cobra seems faster since it doesn't have to actually render the HTML page to a graphic context.
By default, JRex/Gecko includes a Flash plug-in while Cobra does not. (Since the plug-in mechanism for the LoboBrowser requires Java code, plug-ins for other browsers will not work. Until a Java Flash plug-in is available, Cobra will not handle Flash.) The JavaScript in some pages will cause a modified page to be loaded if Flash isn't present. In some data mining tasks, being able to examine the <OBJECT> and <EMBED> tags is useful and might not be available in Cobra unless a plug-in for Flash is installed.
JRex/Gecko seems to handle less well-formed HTML than Cobra. A missing <HTML> or <HEAD> tag can cause Cobra to quit before building the complete DOM. But since the LoboBrowser does properly render one of my test pages that Cobra fails on, perhaps this is less of a problem than I think.

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More