API Example: Download content extractor with Java/javascript

Source: Internet
Author: User
Tags xslt xslt processor python web crawler

650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M02/83/51/wKiom1dwnV6xOQxUAACTgoEut1Q990.png "title=" Python15.png "alt=" Wkiom1dwnv6xoqxuaactgoeut1q990.png "/>1, Introduction

This article explains how to use Java and JavaScript to download the content extractor using the Gooseeker API interface, which is an example program. What is a content extractor? Why in this way? From Python instant web crawler Open Source project: Save programmer time by generating content extractor. See the definition of content extractor for details.

2, download the content extractor in Java

This is a series of instances of the program, in the current programming language development, Java implementation of Web content extraction is not appropriate, in addition to language is not flexible and convenient, the whole ecosystem is not active, the optional class library growth is slow. In addition, to extract content from JavaScript Dynamic Web pages, Java is also inconvenient and requires a JavaScript engine. Download the content extractor with JavaScript to skip to part 3rd.

Specific implementation

Annotations:

    • With the Java class Library Jsoup (1.8.3 or more), it is convenient and quick to get the web dom.

    • Get XSLT through the Gooseeker API (refer to 1-minute quick build XSLT for Web page content extraction)

    • Perform Web content conversion using Java's own class Transformerfactory

    public static void main (String[] args)     {         InputStream xslt = null;         try        {             String grabUrl =  "http://m.58.com/cs/qiuzu/ 22613961050143x.shtml "; //  Crawl URL              String resultPath =  "F:/temp/xslt/result.xml"; //  the storage path of the fetch result file              //  get xslt    via Gooseeker api interface          xslt = getgsextractor ();             //  Crawl Web content conversion results File               Convertxml (Graburl, xslt, resultpath);         } catch   (exception e)         {             e.printstacktrace ();         }  finally        {             try            {                 if  (xslt ! = null)                      xslt.close ();             }  catch  (ioexception e)             {                  e.printstacktrace ();             }        }     }    /**     *  @description   Dom Conversion      */    public static void convertxml ( String graburl, inputstream xslt, string resultpath)  throws Exception     {        //  The Doc object here refers to the document object in Jsoup          org.jsoup.nodes.document doc = jsoup.parse (new  URL (Graburl). OpenStream (),  "UTF-8",  graburl);         w3cdom  w3cdom = new w3cdom ();        //  The W3cdoc object here refers to the document object in the  &Nbsp;      org.w3c.dom.document w3cdoc = w3cdom.fromjsoup (DOC);         source srcsource = new domsource (W3CDOC);         TransformerFactory tFactory =  Transformerfactory.newinstance ();         transformer transformer  = tfactory.newtransformer (New streamsource (XSLT));         transformer.transform (Srcsource, new streamresult (New fileoutputstream (ResultPath)));     }    /**     *  @description   Get API return Results      */    public static InputStream  Getgsextractor ()     {        // api interface          string apiurl =  "Http://www.gooseeker.com/api/getextractor";         //  Request Parameter         map<string,object> params  = new HashMap<String, Object> ();         Params.put ("key",  "xxx"),   // gooseeker Member Center Application api key         params.put ("theme",  "xxx");  //  extractor name, which is the name of the rule defined by MS for several sets          params.put ("middle",  "xxx");  //  rule number, if more than one rule is defined under the same rule name, you need to fill in         params.put ("bname",  "xxx"); //  organize the box name if the rule contains more than one sorting box , you need to fill in the         string httparg = urlparam (params);         apiUrl = apiUrl +  "?"  + httparg;        inputstream is = null;        try         {             url url = new url (Apiurl);             HttpURLConnection urlCon =  (HttpURLConnection)  url.openconnection ();             urlcon.setrequestmethod ("GET");             is = urlcon.getinputstream ();         } catch  (protocolexception e)          {             E.printstacktrace ();        } catch  (IOException e)       &nbSp; {            e.printstacktrace ();         }        return is;     }    /**     *  @description   Request Parameter      */    public static string urlparam (MAP <string, object> data)     {         stringbuilder sb = new stringbuilder ();         for  (Map.entry<string, object> entry : data.entryset ())          {            try             {            &nbsP;    sb.append (Entry.getkey ()). Append ("="). Append (Urlencoder.encode ()  +  "",  "UTF-8")). Append ("&");             } catch  (unsupportedencodingexception e)              {                 e.printstacktrace ();             }        }        return  Sb.tostring ();     }    [object object]
3, download the content extractor with JavaScript

Note that if the JavaScript code for this example is running on a Web page, because of cross-domain issues, it is not possible to crawl the content of non-site web pages. So, run on a privileged JavaScript engine, such as browser extensions, self-developed browsers, JavaScript engines in your own programs, and so on.

In order to facilitate the experiment, this example is still running on the webpage, in order to bypass the cross-domain problem, the target webpage is saved and modified to insert JavaScript into it. So many manual operations, only for the sake of experimentation, formal use of the need to consider other means.

Specific implementation

Annotations:

    • reference JQuery class library (jQuery-1.9.0 above)

    • inserting JavaScript code in the target Web page

    • using the Gooseeker API to download the content extractor, the content extractor is an XSLT program, The following example uses the Ajax method of jquery to get the XSLT

    • content extraction with an XSLT processor

Below is the source code:

  Destination page URL is http://m.58.com/cs/qiuzu/22613961050143x.shtml, pre-Save the cost of HTML file, and insert the following code $ (document). Ready (Function ( {    $.ajax ({        type:  "Get",          url:  "http://www.gooseeker.com/api/getextractor?key= Appkey&theme= rule subject name for the request ",         datatype: " xml ",          success: function (XSLT)              {             var result = convertxml (xslt, window.document);             alert ("Result:"  + result);         }     });   });/*  convert DOM to XML object with XSLT  */function  Convertxml (xslt, dom) {    //  Defining Xsltprocessor Objects     var xsltprocessor = new  xsltprocessor ();     xsltprocessor.importstylesheet (XSLT);     //  transformtodocument Way     var result =  Xsltprocessor.transformtodocument (DOM);     return result;}

The returned results are as follows
650) this.width=650; "src=" Https://segmentfault.com/img/bVysAD "style=" border:1px solid rgb (221,221,221); vertical-align:middle;padding:3px; "alt=" Bvysad "/>

4, Outlook

You can also use Python to get the content of the specified Web page, feel the syntax of Python more concise, followed by an example of adding Python language, interested in a small partner can join the study.

5, related documents

1, Python instant web crawler: API description

6, set search Gooseeker source code Download

1, gooseeker Open source Python web crawler GitHub source

7, Document modification history

1,2016-06-27:v1.0


This article from "Fullerhua blog" blog, declined reprint!

API Example: Download content extractor with Java/javascript

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.