Selenium executing the PHANTOMJS API and getting execution results

Last Update:2017-04-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

New Blog Address: http://bendawang.site/article/selenium%E5%9C%A8%E7%9B%AE%E6%A0%87%E9%A1%B5%E9%9D%A2%E6%89%A7%E8%A1% 8CPHANTOMJS%E7%9A%84API%E5%B9%B6%E8%8E%B7%E5%8F%96%E8%BF%94%E5%9B%9E%E5%80%BC (PS: Short-term csdn and new blogs will be updated in sync)

Objective

Because recently to write a crawl sitemap and the corresponding parameters of the small script, the existing crawler no matter what language is written, almost can not grasp parameters, so I think about it, first do a simple summary.

Originally thought to write this kind of Sitemap crawler is very simple, after thinking only to find the terrible place, the most critical is the extraction of parameters, this is too troublesome ... This time only to discover the invincible and powerful Awvs ...

If we want to get the sitemap of the site and also crawl the corresponding link parameters, I probably summed up a few sources of the URL:

1, the direct existing form form on the page and the links and parameters such as the existing href are relatively simple, but consider post and get issues.
2, form form in the DOM generated by JS and href links to
3, access requests initiated by JS, such as AJAX requests, etc.
4, click and then call JS to send the request, or click Generate a form or produce a DOM , and then click on the JS to send the request. For example, the following code

    <div>    id="searchTitle"name="searchTitle" value="" type="text">    <divclass="button" onclick="javascript:searchWeb();"></div>    </div>

5, through the setTimeout function delay triggered by the JS request, for example setTimeout("request()", 2000); , this kind of I do not have a very good solution, but there is a preliminary approach, it will be said later.

At present, I probably think of such five categories, there must be no consideration of the place, and the actual code has not been written out, I first record my thoughts, if any of the teachers are interested please be sure to contact me .... Orz.

To solve the above 5 problems, because the first half of my project is written in Python, so I need to use Python to solve, then the best choice is necessarily selenium and phantomjs , in fact, I would like to use the original phantomjs to write.
The phantomjs first and the second problem is the same, the direct matching down on the line, because it will help us to first the page js execution.

The third problem is relatively good, and we phantomjs onResourceRequested can monitor all requests from the page through our native API.

And then look at the fourth question, my current ideas should not be solved completely, we can also use phantomjs to all the DOM of the page to send an click event, but in this case time is a big problem, so the initial idea to all onclick the tags with the event to send a click event

Then look at the fifth question, this should be the most troublesome one, my initial idea is to use the onResourceRequested event, and then set a time-out period, so that the page execution for a few seconds, but finally I gave up the idea, I decided to ignore the problem, because if each page is waiting for a few seconds that time is not spent in the sky.

These are some of my initial thinking, there are many immature places.

The problem of linkage between selenium and PHANTOMJS

have been known to have selenium this thing, not enough because there is no place to need, also can not go to study, but to PHANTOMJS may be slightly familiar with some.

Just write a simple procedure.

fromimport webdriverservice_args=[]service_args.append(‘--load-images=no‘)  ##关闭图片加载service_args.append(‘--disk-cache=yes‘)  ##开启缓存service_args.append(‘--ignore-ssl-errors=true‘##忽略https错误d=webdriver.PhantomJS("phantomjs",service_args=service_args)d.get("http://xxxxxxxxxxxxxxxxxxxxx")print d.page_sourced.quit()

This makes it possible to send a GET request.

Problem one: No post requests?

I think it's not enough for me to know. But turned the API, did not find, I hope you can point out my mistakes, but I really do not seem to find the place to send a POST request, it is stupid to explode.
Here I think of two ways to solve, first of all, the second to stay behind to say.
is to use the requests library to submit a POST request, take down the cookie, call the add_cookie function to it, and then let it take the cookie send GET request.
Samples such as the following

 fromSeleniumImportWebdriverImportRequestsr=requests.session () Service_args=[]service_args.append ('--load-images=no ')# #关闭图片加载Service_args.append ('--disk-cache=yes ')# #开启缓存Service_args.append ('--ignore-ssl-errors=true ')# #忽略https错误D=webdriver. PHANTOMJS ("Phantomjs", Service_args=service_args) data={"username":"123","Password":"123","Login":"1"}result=r.post ("http://127.0.0.1:8000/web/login.php", Data=data) cookies=r.cookies.get_dict () forIinchCookies:d.add_cookie ({' name 'I' value ': Cookies[i],' path ':'/',' domain ':' 127.0.0.1 '}) D.get ("http://127.0.0.1:8000/web/index.php")PrintD.page_sourced.quit ()

In addition this add_cookie function is also more tricky, but also to put path and domain all set up, or sometimes will error.

The second way, we know that if we use native PhantomJS , we can easily submit a POST request, such as the following:

var  webPage = require  ( ' webpage ' ); var  page = Webpage.create ();  var  settings = {operation:  "POST" , header:{}, Data:  "username=123&password=123&login=1" };p Age.open (  ' http://127.0.0.1:8000/web/login.php ' , settings, <  Span class= "Hljs-keyword" >function   {  //console.log (page.content);  for  (var  i=0 ; i<page.cookies.length;i++) {Console.log (Page.cookies[i].name+ +page.cookies[i].value)}});

So we think of the way is directly in the Selenium PhantomJS implementation of its API can be, here do not post, read the next section will know how to write.

Question two: in SeleniumObtained in PhantomJSThe results of the API execution?

Fortunately Selenium with a get_log function, such as I monitor the ' http://127.0.0.1:8000/web/index.php ' page to send out all requests, if the original phantomjs , very good, as follows:

varrequire(‘webpage‘);varfunction (request) {    console.log(‘Request ‘ + request.url);};......................

So the API that we call directly in Selenium PhantomJS is good. As follows

 fromSeleniumImportWebdriverImportRequestsr=requests.session () Service_args=[]service_args.append ('--load-images=no ')# #关闭图片加载Service_args.append ('--disk-cache=yes ')# #开启缓存Service_args.append ('--ignore-ssl-errors=true ')# #忽略https错误D=webdriver. PHANTOMJS ("Phantomjs", Service_args=service_args) data={"username":"123","Password":"123","Login":"1"}result=r.post ("http://127.0.0.1:8000/web/login.php", Data=data) cookies=r.cookies.get_dict () forIinchCookies:d.add_cookie ({' name 'I' value ': Cookies[i],' path ':'/',' domain ':' 127.0.0.1 '}) script ="var page=this;page.onresourcerequested = function (Request) {Console.log (request.url);};"d.command_executor._commands[' Executephantomscript '] = (' POST ','/session/$sessionId/phantom/execute ') D.execute (' Executephantomscript ', {' script ': script,' args ': []}) D.get ("http://127.0.0.1:8000/web/index.php")PrintD.page_sourced.quit ()

The above code can actually be executed in real time, but there is no way to get the result if it is written like this.
Here's a function that needs to be get_log improved as follows:

 fromSeleniumImportWebdriverImportRequestsr=requests.session () Service_args=[]service_args.append ('--load-images=no ')# #关闭图片加载Service_args.append ('--disk-cache=yes ')# #开启缓存Service_args.append ('--ignore-ssl-errors=true ')# #忽略https错误D=webdriver. PHANTOMJS ("Phantomjs", Service_args=service_args) data={"username":"123","Password":"123","Login":"1"}result=r.post ("http://127.0.0.1:8000/web/login.php", Data=data) cookies=r.cookies.get_dict () forIinchCookies:d.add_cookie ({' name 'I' value ': Cookies[i],' path ':'/',' domain ':' 127.0.0.1 '}) script ="var page=this;page.onresourcerequested = function (Request) {Page.browserLog.push (request.url);};"d.command_executor._commands[' Executephantomscript '] = (' POST ','/session/$sessionId/phantom/execute ') D.execute (' Executephantomscript ', {' script ': script,' args ': []}) D.get ("http://127.0.0.1:8000/web/index.php")PrintD.page_sourcePrintD.get_log (' Browser ') D.quit ()

In the JS script we call page.browserLog.push , and then in the Python script we get_log(‘browser‘) go to get to communicate with each other, certainly there are other ways, but I did not find .... Stiff ....

Postscript

Again this crawl sitemaps and request parameters of the small script, think of or feel very troublesome, although a few problems have a corresponding solution, and good or bad, plus today a little research on the optimization of Python calls, or a little phantom bit of confidence, But feel to be integrated into one or will be very troublesome, efficiency problem is one, can really exactly grasp the complete is another, slowly to put, did not think originally thought is not the problem finally became one of my biggest trouble. Stiff... Finally, the Phantomjs of the soundtrack is really much more comfortable than the selenium. Now if you want to think about each link at the very beginning, if you think about it will not be used Python, probably will use Nodejs, on, recently a little research on some of the Nodejs infiltration and attack methods, follow-up to share out, I hope that the teachers can help guide guidance.

Selenium executing the PHANTOMJS API and getting execution results

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Selenium executing the PHANTOMJS API and getting execution results

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Selenium executing the PHANTOMJS API and getting execution results

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support