Web Automation testing and Intelligent Crawler Weapon: PHANTOMJS Introduction and actual combat

Source: Internet
Author: User
Tags fast web

It is estimated that some students have not heard of the tool, the first simple introduction of its background and role.

1. What is PHANTOMJS?

Phantomjs is a WebKit-based server-side JavaScript API that supports Web support without the need for browser support, and natively supports a variety of web standards such as DOM processing, JavaScript, CSS selectors, JSON, Canvas and Scalable Vector graphics svg. Phantomjs is mainly through JavaScript and Coffeescript control the WebKit CSS selector, Scalable Vector graphics SVG and HTTP network and other modules. Phantomjs mainly supports Windows, Mac OS, Linux three platforms, and provides a corresponding binary installation package.

the usage scenarios for PHANTOMJS are as follows:

    • Web Testing without a browser: Fast Web Testing without a browser, and support for many test frameworks such as Yui Test, Jasmine, Webdriver, Capybara, Qunit, mocha, etc.

    • page Automation: Access and manipulate Web pages using the standard DOM API or some JavaScript frameworks, such as jquery.

    • Screen Capture: Programmatically Grab page content such as CSS, SVG, and canvas for web crawler applications. Build server-side web graphics applications, such as services, vector raster applications.

    • Network monitoring: Automatic network performance monitoring, tracking page loading and the relevant monitoring information in the standard HAR format export.

PHANTOMJS has formed a very powerful ecosystem of content, related projects are as follows:

    • casperjs: An open source navigation script processing and advanced testing tool

    • poltergeist : Test utility Capybara Test drive

    • PHANTOMROBOT:PHANTOMJS Robot test Framework

    • mocha-phantomjs:javascript Test Framework Mocha client

In addition, the ecosystem includes a number of screenshots based on PHANTOMJS, such as Capturejs, Pageres, Phantomjs-screenshots, Manet, Screenshot-app, etc., and node. JS, Django, PHP, Sinatra and other languages of the API and confess, Ghoststory, Grover and many other tools.

PHANTOMJS Currently the latest version is 2.0, currently in addition to the binary version of Linux is not released, the other cross-platform version of the release of binary and source packages to choose from, the test environment used in this article is from the Windows Binary version 2.0.

2, Phantomjs VS Selenium

last year, in the article "Introduction and application of Web Automation test tool Selenium " introduced the use and function of Selenium, in fact, it is also a Web automation test tool, is ThoughtWorks is an acceptance testing tool specifically written for Web applications. The selenium test runs directly in the browser, just as the real user is doing. Supported browsers include IE (7, 8, 9), Mozilla Firefox, Mozilla Suite, and more. The main features of this tool include: test and browser compatibility--test your application to see if it works well on different browsers and operating systems. Test system functionality-Create a regression test to verify software functionality and user requirements. Supports automatic recording of actions and automatic generation of test scripts in different languages such as. Net, Java, and Perl.

used to estimate that the students have feelings, is that the goods are essentially dependent on the browser, every step of the operation is directly manipulate the graphical browser, so whether from the performance or programmability is much worse, and today the introduction of the PHANTOMJS is not, it in addition to have Selenium Most of the functions, the more powerful is that he is a "headless browser", no graphical interface, directly oriented to the program API interface, performance and operability is much higher than Selenium. The most important of these two tools is to be able to execute the page JS, now popular basically the following several:

    • Qtwebkit, known to have Python and C + + support

    • Phantomjs, known for JavaScript, Coffeescript, and Python support, is also a Webkit kernel

    • Slimerjs, known to have JavaScript support, Gecko kernel, and Firefox is the same, can also run on Firefox

    • Casperjs, JavaScript support is known. Two further packages on top

This important feature allows them to be combined with some reptile frames to use, visually a large wave of intelligent crawlers are coming to US ~-_-| | |

3, actual combat: Crawl a page All sub-requests

A simple introductory tutorial here, you can refer to the official document or the end of the link, assuming we now have a requirement, we need to crawl, analysis of a page load when the browser initiated all child requests, the effect as shown below:

In fact, this function Phantomjs examples Netlog.js has been implemented, but the official example in the network is not good, the page is complicated when the request is easy to miss, I made a slight change here:

var page = require (' webpage '). Create (), System = require (' System '), address;if (system.args.length = = 1) {    Console.log (' Usage:netlog.js <some url> ');    Phantom.exit (1);} else {    address = system.args[1];    page.onresourcerequested = function (req) {        //console.log (' requested: ' + json.stringify (req, undefined, 4));        Console.log (Json.parse (json.stringify (req, undefined, 4)). url);    };    page.onresourcereceived = function (res) {    //    Console.log (' Received: ' + json.stringify (res, undefined, 4));    //};    Page.open (address, function (status) {        if (status!== ' success ') {            console.log (' FAIL to load the address '); 
   }        window.settimeout (function () {            phantom.exit (1);        },;}    );}

Effect:

Timeout Phantomjs netlog.js http://bj.fang.ooxx.comhttp://bj.fang.ooxx.com/http://include.aifcdn.com/aifang/res /2015042706/b/aifang_web_loupan_list_listindex.csshttp://pages.aifcdn.com/prism/performance.js?v= 1416480080http://pages.aifcdn.com/js/jquery-1.9.1/jquery-1.9.1.min.jshttp://pages.aifcdn.com/js/aa/bb.jshttp:/ /include.aifcdn.com/aifang/res/2015042706/b/aifang_web_loupan_list_listindex.jshttp://tracklog.ooxx.com/ referrer_ooxx_pc.jshttp://ifx.fang.ooxx.com/s?p=918&c=14&o=1&st=ajkhttp://ifx.fang.ooxx.com/s?p= 2000&c=14&r=0&sr=0&pa=&o=1&t=&st=ajkhttp://pic1.ajkimg.com/display/xinfang/ C43a03221d9fd83d6b409f31909ce19a/160x120.jpghttp://chart.aifcdn.com/average/price/city/?id=14&w=210&h= 100&limit=6&date=20150428025144&logo=1http://ifx.fang.ooxx.com/s?p=2001&c=14&o=1&st= AJK ...  

Another example of netsniff.js is to export captured network requests into the HAR format and then visualize the analysis, and interested students can refer to this official example.

Note:

(1) The page.settings.resourceTimeout of PHANTOMJS can only be used for timeout control of the parent request of the current page, and cannot be used for the timeout control of the child request, so that when a request on a page is blocked, it will cause the whole request to die. Fortunately, if its child request is asynchronous, you can choose to interrupt the request to get the existing data:

Timeout 3 phantomjs netlog.js http://bj.fang.ooxx.com/|grep tracklog

(2) Although PHANTOMJS to 2.0 is relatively mature, but some of the documentation and API functionality is not perfect, such as Evaluatejavascript's documentation is imperfect, the function seems to have bugs:

var webpage = require (' webpage '); var page = webpage.create (); function Add (arg1, arg2) {console.log (arg1 * arg2);}; Add (2, 3);p Age.evaluatejavascript (' function add (arg1, arg2) {console.log (arg1 * arg2);}; Add (2, 3); '); Phantom.exit ();//result 6syntaxerror:expected token ') '  phantomjs://webpage.evaluate (): 1 in evaluatejavascriptsyntaxerror:expected token ') '  phantomjs://webpage.evaluate (): 1 in Evaluatejavascript
4, Python under the PhantomJS:ghost.py

in fact, Python under the ghost.py and Phantomjs no relationship, here is not familiar with JS classmate recommended.

ghost.py can also be done, and the overall function is similar to &NBSP;PHANTOMJS:

# coding=utf-8# Test utf-8 encoding from Multiprocessing.pool import poolimport sysreload (SYS) sys.setdefaultencoding (' Utf-8 ')    From ghost Import ghostimport timedef requesturl (URL): resultstr = URL + "\ n" t1 = Time.clock () Ghost = Ghost () Try:page, resources = Ghost.open (URL, wait=true, timeout=30) Req_found_flag = 0 for index, Trackre Q in Enumerate ([Res.url-Res in resources if "Tracklog" in Res.url]): resultstr = resultstr + str (index) + " \ t "+ trackreq +" \ n "req_found_flag = 1 if Req_found_flag = = 0:resultstr = ResultStr +" REQ Uests not found Tracklog ' s URL: "+ URL +" \ n "except Exception, E:RESULTSTR = ResultStr + str (e) +": "+ URL  + "\ n" ghost.exit () t2 = Time.clock () resultstr = ResultStr + str (T2-T1) + ":" + URL + "\ n" Print ResultStr + "-----------------------" + "\ n" If __name__ = = "__main__": ts = time.time () urls = [' http://bj.ooxx.com/t est/', ' http://bj.ooxx.com/', ' http://bj.ooxx.com/job.shtml ', ' http://bj.ooxx.com/chuzu/', ' http://bj.ooxx.com/ershoufang/',        ' Http://bj.fang.ooxx.com/?from=58_home_top ', ' http://bj.ooxx.com/ershouche/', ' http://che.ooxx.com/', ' http://bj.ooxx.com/sale.shtml ', ' http://bj.ooxx.com/dog/', ' http://bj.ooxx.com/huangye/', ' HT Tp://pic2.ooxx.com/m58/app58/m_static/home.html ', ' http://bangbang.ooxx.com/jc_pc_homepage_3.html ', ' HTTP// Jinrong.ooxx.com/k?from=58_index_ss ', ' http://about.ooxx.com/hr/'] p = Pool (4) P.map (Requesturl, URLs Print ("Cost time is: {:. 2f}s". Format (Time.time ()-TS)//Result: C:\Python27\python.exe F:/SOURCEDEMO/TEST.PYHTTP://BJ . ooxx.com/test/requests not found Tracklog ' s url:http://bj.ooxx.com/test/0.585803057247:http://bj.ooxx.com/test/-- ---------------------http://bj.ooxx.com/requests not found Tracklog ' s url:http://bj.ooxx.com/0.619204294693:http:/ /bj.ooxx.com/-----------------------Http://bj.ooxx.com/job.shtml0http://tracklog.ooxx.com/referrer4.js1http://tracklog.ooxx.com/referrer4.js2http ://tracklog1.ooxx.com/pc/empty.js.gif?fromid=referrer4&site_name=58&tag=pvstatall&referrer=& type=index&post_count=-1&_trackparams=na&version=a&loadtime=376&window_size=614x454& trackurl={' new_uv ': ' 1 ', ' new_session ': ' 1 ', ' init_refer ': ', ' GTID ': ' 14301578423960.8530509551055729 ', ' Cate ': ' 9224 ' Area ': ' 1 ', ' pagetype ': ' Index ', ' ga_pageview ': '/index/zhaopin/job/'}&rand_id= 0.25807349151000381.19415236255:http://bj.ooxx.com/job.shtml-----------------------......

Although ghost.py the whole function and Phantomjs similar, but its compatibility is still a big cut:

(1) The request is not optimized, and for multiple identical reference requests on the page, ghost.py will make an honest request multiple times, rather than just one request at a time.

(2) for JS asynchronous code and Function encapsulation execution, not enough compatibility, unable to capture the request or execution, the following two kinds of writing in Ghost have a problem:

</script><script src= "//tracklog.ooxx.com/referrer_ooxx_pc.js" type= "text/javascript" Async defer>< /script><script type= "Text/javascript" >readytodo ("$", function () {  $.getscript ('/HTTP/ Tracklog.ooxx.com/referrer4.js ');}); </script>

(3) Like Phantomjs, Ghost also has a request timeout control is not friendly, but ghost seems to be more serious problem, do not request the completion of the data can not be taken.

Well, this article introduces PHANTOMJS here, mainly through a practical example to show the powerful features and features of PHANTOMJS, and in the actual web automation testing or crawler requirements, some of its other features we might just be able to use ~

5, Refer:

[1] phantomjs: Server-side JavaScript API based on WebKit and open source

Http://www.infoq.com/cn/news/2015/01/phantomjs-webkit-javascript-api

[2] Phantomjs not waiting for "full" page load

Http://stackoverflow.com/questions/11340038/phantomjs-not-waiting-for-full-page-load

[3] PHANTOMJS webpage timeout

Http://stackoverflow.com/questions/16854788/phantomjs-webpage-timeout

http://t.cn/RARvSI4

[4] is there a library that can parse JS?

http://segmentfault.com/q/1010000000533061

[5] Java call PHANTOMJS collection Ajax load generated Web page

http://blog.csdn.net/imlsz/article/details/24325623

[6] parsing Web pages with JS using selenium and Phantomjs

http://smilejay.com/2013/12/try-phantomjs-with-selenium/

[7] phantomjs Quick Start Tutorial

http://duchengjiu.iteye.com/blog/2201868

[8] Phantomjs API

http://phantomjs.org/api/

[9] ghost.py

http://carrerasrodrigo.github.io/Ghost.py/

http://jeanphix.me/Ghost.py/

Web Automation testing and Intelligent Crawler Weapon: PHANTOMJS Introduction and actual combat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.