In the previous section, we learned about selenium, using Python to manipulate the browser, and it's best to do Web automation testing. If we were to do a crawler with a browser with an interface doesn't seem so good, what can you do? Don't worry, we're going to introduce a browser without an interface--Phantomjs. It is a non-interface, scriptable WebKit browser engine that supports a variety of Web standards: DOM manipulation, CSS selectors, Json,canvas, and SVG.
It may be asked here: Why use a browser to crawl the page data, not before we get the page directly through Urllib and then analyze it? This is because we get only a single HTML page code through URLLIB, but in fact each page will use JS to render the page, it is simply possible to get the data through Ajax, and then add new elements to the page, If we use Urllib to get the HTML code alone, we can't execute the JS code in the page. So we need this feature to get a fully rendered page so that we don't lose the data. So we need to use this PANTOMJS.
1. Installation of PHANTOMJS
PHANTOMJS installation methods are two, one is to download the source code to compile themselves, the other is to directly download the compiled binary files. However, it takes too long to compile and requires a lot of disk space. It is recommended to download the binaries directly and then install them. Click below to select Platform download.
Http://phantomjs.org/download.html
My environment is centos7--x64, so I downloaded the phantomjs-2.1.1-linux-x86_64.tar.bz2.
I then installed it in/usr/local, first extracting it into the/usr/local/directory, then renaming it, and then adding the bin directory to the environment variable.
sudo tar -jxvf phantomjs-2.1. 1-linux-x86_64. tar. bz2-c/usr/local/sudomv /usr/local/phantomjs-2.1. 1-linux-x86_64/usr/local/phantomjssudoln -s/usr/local/phantomjs/bin/ Phantomjs/usr/local/bin/phantomjs
Test the following, enter in any directory: Phantomjs-v, see if you can call
2. Hello World
hello.js
File:
Console.log (' Hello, world! ' );p hantom.exit ();
Executed through Phantomjs hello.js.
The program outputs the hello,world! The second sentence of the procedure terminates the execution of the Phantom.
Note: phantom.exit();
this sentence is very important, otherwise the program will never terminate.
3. Page loading
The Phantom can be used to load the page, and the following example implements the loading of the page and saves the page as a picture.
First create a webpage object, then load the site home page, determine the response status, if successful, then save as Example.png
A Example.png page is generated directly after the run
You can also set the size of the window and the size of the image . Viewportsize sets the browser's window size. Cliprect the size of the set.
After the execution of the picture is:
3. Test page load Speed
The loading speed of a page is calculated, and the command-line parameters are used. The program determines how many parameters, if the parameters are not enough, then terminates the operation. Then record the time to open the page, after the request page, then record the current time, the difference is the page loading speed
Test Baidu loading speed, this time includes JS rendering time, of course, and the speed is also related.
4. Code Evaluation
Using the Evaluate method we can get the source code of the webpage. This execution is "sandboxed" and does not execute JavaScript code outside the Web page. The Evalute method can return an object, but the return value is limited to the object.
Below is an example that shows the title of the page:
var page = require (' webpage 'function(status) { var title = Page.evaluate (function() { return document.title; }); Console.log (' Page title is ' + title); Phantom.exit ();});
Any console information that comes from a Web page and includes internal code from evaluate () is not displayed by default.
If you want to display it, you need to override this behavior, using the Onconsolemessage callback function, and the example can be changed to
The result of its execution is:
5. Network Monitoring
Because the PHANTOMJS has network communication check function, it is also suitable for the analysis of network behavior.
At the time of request, you can override the onresourcerequested and Onresourcereceived callback functions to receive the request for resources and the completion of the resource acceptance monitoring. The following is the information that promises to be requested and the information returned, presented in JSON form.
After the execution:
6. Manipulating the DOM
Scripts are run as if they were in a browser, so standard JavaScript DOM manipulation and CSS selectors are also in effect.
For example, the following example modifies the user-agent, and then goes to httpuseragent.org, which is the analysis of the current access useragent, and then we get the page to render the elements of the useragent, the value of the print out.
Execution Result:
7. Using additional libraries
After the 1.6 version allows the addition of external JS libraries, such as the following example added jquery, and then executed the jquery code
Python crawler Learning (Selenium): A good base friend of Phantomjs