PHANTOMJS Profile __js

Source: Internet
Author: User
Tags java se
In the reptile, the Natural language processing group 320349384 exchanges, the accidental contact Phantomjs, Casper and so on httpclient the newer frame and the collection solution, after the micro-investigation discovers the plan to be feasible, therefore does the Qingming 3rd the strength, Two of its development and application in Baidu Meta search information collection project, to achieve the desired effect, the next step will focus on the application of Tencent Micro-Bo acquisition and grab tickets for the project. Here's a step-by-step introduction.

First, Phantomjs Introduction

(1) A headless browser based on the WebKit kernel, that is, there is no UI interface, that is, it is a browser, only the click, page, and other human-related operations need to design the implementation.

(2) Provide JavaScript API interface, that is, by writing JS program can directly interact with the WebKit kernel, on top of this can be combined with the Java language, through Java call JS and other related operations, so as to solve the previous C + + Can be better based on the WebKit development of quality collector restrictions.

(3) To provide Windows, Linux, Macs and other operating system installation package, that is, can be on different platforms two times to develop acquisition projects or automated project testing and so on.

Introduction of PHANTOMJS Common API

Recent days of learning to find a lot of information, including the official website, but the relevant learning materials or relatively few, many problems are carried out in the N-Test to get clear, this cost a lot of time. In the study, combined with the official website and this blog will be better results.

(1) Common built-in several large objects

1 2 3 var system=require (' System ');   Obtain the system operation object, including the command line parameter, the PHANTOMJS system setup and so on information var page = require (' webpage '); Gets the object that operates the DOM or Web page, which is the core object by opening the Web page, receiving the page content, request, and response parameters.   var fs = require (' FS '); Gets the file system object, which allows you to manipulate the operating system's file operations, including read, write, move, copy, delete, and so on.

(2) Common API

1 2 3 4 5 6 7 8 9 10 11-12 Page.open (URL, function (status) {}  //The URL link is opened through the Page object, and the callback function that it declares can be recalled, and its callback occurs when the URL is completely opened. That is, the requested item that is raised by the URL is all loaded, but the AJAX request is not related to its load completion page.onloadstarted = function () {}/ When Page.open is called, the function is executed first, where some parameters or functions can be preset, for the Page.onresourceerror = function (resourceerror) {}// A variety of failures occurred during loading of the resource to be loaded by page, this callback handles page.onresourcerequested = function (RequestData, networkrequest) {}// The resource to be loaded by the page can be invoked at the time the request is initiated, and the page.onresourcereceived = function (response) {}//page will be loaded with a resource that is loaded, each loaded with a related resource, Will respond first, which corresponds to the HTTP header part,  its core callback object as response, where you can get the cookies, useragent, etc. page.onconsolemessage = function (msg) { If you want to print some output information to the console while executing the Web page, you can display this callback. Page.onalert = function (msg) {}//phantomjs is not interface, so it is not possible for alert to pop directly, so phantomjs the function to callback the alert event during page execution Page.onerror = function (msg, trace) {}//when the URL in Page.open, it itself (excluding other loading resources caused by) an exception, such as 404, no route to Web site, and so on, will be shown in this callback. page.onurlchanged = function (targeturl) {}//when the URL opened by Page.open or the URL was jumped based on the URL during the opening process, it can be recalled in this function. page.onloadfinished = Function (status) {} //When the target URL of Page.open is actually opened, the function is called before the callback function of the open is called, where you can make an internal paging operation Page.evaluate (function () {}); The function is executed inside the loaded Web page, such as paging, clicking, sliding, and so on, in which Page.render ("") can be executed; Renders the current page status as a picture and outputs it to the specified file.


(3) Matters needing attention

1, distinguish PHANTOMJS objects and open web page objects, such as document, window, both have, in the call page.evaluate and do not call, pay attention to distinguish between the scope of the two, easy to debug when there are many problems, and not find.

2, Page.injectjs and Page.includejs difference, the former focus on local JS files, and LibraryPath hanging, the latter focus on the network JS files, especially in the introduction of jquery and other third-party libraries, will often encounter.

3, the coding problem, two important parameters,--output-encoding,--script-encoding, the former for output coding, the latter for the use of JS, parameter configuration file encoding, for the convenience of the reference, the recommendations are utf-8 coding, and note the application of the target file encoding , so as not to cause very strange anomalies, and not to check.

Third, Baidu Meta search collector

Mainly Java SE+JS+PHANTOMJS applications,

(1) Write a good JS script program, set aside all configurable parameters, and provide JSON file transfer related parameters.

(2) Through the Java program, define the relevant parameters and generate the corresponding JSON file.

(3) Invoke the API through the Java command line, invoke the PHANTOMJS command, and pass in the JS, the configuration file path, thus opens the reptile.

(4) First collection of Keyword search page of the link collection, and finally unified to traverse the collection of specific object pages.


Iv. Summary of Application

After the above project measurement application, it will be very convenient to use in analog landing, such as micro-bo, electric business, or millet, train tickets and other projects, the next plan to combine with the above projects, the development of more interesting projects.

Welcome to join the crawler, Natural language processing as the theme of the technology group 320349384, more questions and suggestions, welcome to exchange.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.