650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M02/83/51/wKiom1dwnV6xOQxUAACTgoEut1Q990.png "title=" Python15.png "alt=" Wkiom1dwnv6xoqxuaactgoeut1q990.png "/>1, Introduction
This article explains how to use Java and JavaScript to download the content extractor using the Gooseeker API interface, which is an example program. What is a content extractor? Why in this way? From Python instant web crawler Open Source project: Save programmer time by generating content extractor. See the definition of content extractor for details.
2, download the content extractor in Java
This is a series of instances of the program, in the current programming language development, Java implementation of Web content extraction is not appropriate, in addition to language is not flexible and convenient, the whole ecosystem is not active, the optional class library growth is slow. In addition, to extract content from JavaScript Dynamic Web pages, Java is also inconvenient and requires a JavaScript engine. Download the content extractor with JavaScript to skip to part 3rd.
Specific implementation
Annotations:
With the Java class Library Jsoup (1.8.3 or more), it is convenient and quick to get the web dom.
Get XSLT through the Gooseeker API (refer to 1-minute quick build XSLT for Web page content extraction)
Perform Web content conversion using Java's own class Transformerfactory
public static void main (String[] args) { InputStream xslt = null; try { String grabUrl = "http://m.58.com/cs/qiuzu/ 22613961050143x.shtml "; // Crawl URL String resultPath = "F:/temp/xslt/result.xml"; // the storage path of the fetch result file // get xslt via Gooseeker api interface xslt = getgsextractor (); // Crawl Web content conversion results File Convertxml (Graburl, xslt, resultpath); } catch (exception e) { e.printstacktrace (); } finally { try { if (xslt ! = null) xslt.close (); } catch (ioexception e) { e.printstacktrace (); } } } /** * @description Dom Conversion */ public static void convertxml ( String graburl, inputstream xslt, string resultpath) throws Exception { // The Doc object here refers to the document object in Jsoup org.jsoup.nodes.document doc = jsoup.parse (new URL (Graburl). OpenStream (), "UTF-8", graburl); w3cdom w3cdom = new w3cdom (); // The W3cdoc object here refers to the document object in the &Nbsp; org.w3c.dom.document w3cdoc = w3cdom.fromjsoup (DOC); source srcsource = new domsource (W3CDOC); TransformerFactory tFactory = Transformerfactory.newinstance (); transformer transformer = tfactory.newtransformer (New streamsource (XSLT)); transformer.transform (Srcsource, new streamresult (New fileoutputstream (ResultPath))); } /** * @description Get API return Results */ public static InputStream Getgsextractor () { // api interface string apiurl = "Http://www.gooseeker.com/api/getextractor"; // Request Parameter map<string,object> params = new HashMap<String, Object> (); Params.put ("key", "xxx"), // gooseeker Member Center Application api key params.put ("theme", "xxx"); // extractor name, which is the name of the rule defined by MS for several sets params.put ("middle", "xxx"); // rule number, if more than one rule is defined under the same rule name, you need to fill in params.put ("bname", "xxx"); // organize the box name if the rule contains more than one sorting box , you need to fill in the string httparg = urlparam (params); apiUrl = apiUrl + "?" + httparg; inputstream is = null; try { url url = new url (Apiurl); HttpURLConnection urlCon = (HttpURLConnection) url.openconnection (); urlcon.setrequestmethod ("GET"); is = urlcon.getinputstream (); } catch (protocolexception e) { E.printstacktrace (); } catch (IOException e) &nbSp; { e.printstacktrace (); } return is; } /** * @description Request Parameter */ public static string urlparam (MAP <string, object> data) { stringbuilder sb = new stringbuilder (); for (Map.entry<string, object> entry : data.entryset ()) { try { &nbsP; sb.append (Entry.getkey ()). Append ("="). Append (Urlencoder.encode () + "", "UTF-8")). Append ("&"); } catch (unsupportedencodingexception e) { e.printstacktrace (); } } return Sb.tostring (); } [object object]
3, download the content extractor with JavaScript
Note that if the JavaScript code for this example is running on a Web page, because of cross-domain issues, it is not possible to crawl the content of non-site web pages. So, run on a privileged JavaScript engine, such as browser extensions, self-developed browsers, JavaScript engines in your own programs, and so on.
In order to facilitate the experiment, this example is still running on the webpage, in order to bypass the cross-domain problem, the target webpage is saved and modified to insert JavaScript into it. So many manual operations, only for the sake of experimentation, formal use of the need to consider other means.
Specific implementation
Annotations:
-
reference JQuery class library (jQuery-1.9.0 above)
-
-
inserting JavaScript code in the target Web page
-
using the Gooseeker API to download the content extractor, the content extractor is an XSLT program, The following example uses the Ajax method of jquery to get the XSLT
-
content extraction with an XSLT processor
Below is the source code:
Destination page URL is http://m.58.com/cs/qiuzu/22613961050143x.shtml, pre-Save the cost of HTML file, and insert the following code $ (document). Ready (Function ( { $.ajax ({ type: "Get", url: "http://www.gooseeker.com/api/getextractor?key= Appkey&theme= rule subject name for the request ", datatype: " xml ", success: function (XSLT) { var result = convertxml (xslt, window.document); alert ("Result:" + result); } }); });/* convert DOM to XML object with XSLT */function Convertxml (xslt, dom) { // Defining Xsltprocessor Objects var xsltprocessor = new xsltprocessor (); xsltprocessor.importstylesheet (XSLT); // transformtodocument Way var result = Xsltprocessor.transformtodocument (DOM); return result;}
The returned results are as follows
650) this.width=650; "src=" Https://segmentfault.com/img/bVysAD "style=" border:1px solid rgb (221,221,221); vertical-align:middle;padding:3px; "alt=" Bvysad "/>
4, Outlook
You can also use Python to get the content of the specified Web page, feel the syntax of Python more concise, followed by an example of adding Python language, interested in a small partner can join the study.
5, related documents
1, Python instant web crawler: API description
6, set search Gooseeker source code Download
1, gooseeker Open source Python web crawler GitHub source
7, Document modification history
1,2016-06-27:v1.0
This article from "Fullerhua blog" blog, declined reprint!
API Example: Download content extractor with Java/javascript