Selenium executing the PHANTOMJS API and getting execution results
New Blog Address: http://bendawang.site/article/selenium%E5%9C%A8%E7%9B%AE%E6%A0%87%E9%A1%B5%E9%9D%A2%E6%89%A7%E8%A1% 8CPHANTOMJS%E7%9A%84API%E5%B9%B6%E8%8E%B7%E5%8F%96%E8%BF%94%E5%9B%9E%E5%80%BC (PS: Short-term csdn and new blogs will be updated in sync)
Objective
Because recently to write a crawl sitemap and the corresponding parameters of the small script, the existing crawler no matter what language is written, almost can not grasp parameters, so I think about it, first do a simple summary.
Originally thought to write this kind of Sitemap crawler is very simple, after thinking only to find the terrible place, the most critical is the extraction of parameters, this is too troublesome ... This time only to discover the invincible and powerful Awvs ...
If we want to get the sitemap of the site and also crawl the corresponding link parameters, I probably summed up a few sources of the URL:
- 1, the direct existing
form
form on the page and the links and parameters such as the existing href
are relatively simple, but consider post and get issues.
- 2,
form
form in the DOM generated by JS and href
links to
- 3, access requests initiated by JS, such as
AJAX
requests, etc.
- 4, click and then call JS to send the request, or click Generate a form or produce a
DOM
, and then click on the JS to send the request. For example, the following code
<div> id="searchTitle"name="searchTitle" value="" type="text"> <divclass="button" onclick="javascript:searchWeb();"></div> </div>
- 5, through the
setTimeout
function delay triggered by the JS request, for example setTimeout("request()", 2000);
, this kind of I do not have a very good solution, but there is a preliminary approach, it will be said later.
At present, I probably think of such five categories, there must be no consideration of the place, and the actual code has not been written out, I first record my thoughts, if any of the teachers are interested please be sure to contact me .... Orz.
To solve the above 5 problems, because the first half of my project is written in Python, so I need to use Python to solve, then the best choice is necessarily selenium
and phantomjs
, in fact, I would like to use the original phantomjs
to write.
The phantomjs
first and the second problem is the same, the direct matching down on the line, because it will help us to first the page js
execution.
The third problem is relatively good, and we phantomjs
onResourceRequested
can monitor all requests from the page through our native API.
And then look at the fourth question, my current ideas should not be solved completely, we can also use phantomjs
to all the DOM of the page to send an click
event, but in this case time is a big problem, so the initial idea to all onclick
the tags with the event to send a click event
Then look at the fifth question, this should be the most troublesome one, my initial idea is to use the onResourceRequested
event, and then set a time-out period, so that the page execution for a few seconds, but finally I gave up the idea, I decided to ignore the problem, because if each page is waiting for a few seconds that time is not spent in the sky.
These are some of my initial thinking, there are many immature places.
The problem of linkage between selenium and PHANTOMJS
have been known to have selenium this thing, not enough because there is no place to need, also can not go to study, but to PHANTOMJS may be slightly familiar with some.
Just write a simple procedure.
fromimport webdriverservice_args=[]service_args.append(‘--load-images=no‘) ##关闭图片加载service_args.append(‘--disk-cache=yes‘) ##开启缓存service_args.append(‘--ignore-ssl-errors=true‘##忽略https错误d=webdriver.PhantomJS("phantomjs",service_args=service_args)d.get("http://xxxxxxxxxxxxxxxxxxxxx")print d.page_sourced.quit()
This makes it possible to send a GET request.
Problem one: No post requests?
I think it's not enough for me to know. But turned the API, did not find, I hope you can point out my mistakes, but I really do not seem to find the place to send a POST request, it is stupid to explode.
Here I think of two ways to solve, first of all, the second to stay behind to say.
is to use the requests
library to submit a POST request, take down the cookie, call the add_cookie
function to it, and then let it take the cookie
send GET request.
Samples such as the following
fromSeleniumImportWebdriverImportRequestsr=requests.session () Service_args=[]service_args.append ('--load-images=no ')# #关闭图片加载Service_args.append ('--disk-cache=yes ')# #开启缓存Service_args.append ('--ignore-ssl-errors=true ')# #忽略https错误D=webdriver. PHANTOMJS ("Phantomjs", Service_args=service_args) data={"username":"123","Password":"123","Login":"1"}result=r.post ("http://127.0.0.1:8000/web/login.php", Data=data) cookies=r.cookies.get_dict () forIinchCookies:d.add_cookie ({' name 'I' value ': Cookies[i],' path ':'/',' domain ':' 127.0.0.1 '}) D.get ("http://127.0.0.1:8000/web/index.php")PrintD.page_sourced.quit ()
In addition this add_cookie
function is also more tricky, but also to put path
and domain
all set up, or sometimes will error.
The second way, we know that if we use native PhantomJS
, we can easily submit a POST request, such as the following:
var webPage = require ( ' webpage ' ); var page = Webpage.create (); var settings = {operation: "POST" , header:{}, Data: "username=123&password=123&login=1" };p Age.open ( ' http://127.0.0.1:8000/web/login.php ' , settings, < Span class= "Hljs-keyword" >function { //console.log (page.content); for (var i=0 ; i<page.cookies.length;i++) {Console.log (Page.cookies[i].name+ +page.cookies[i].value)}});
So we think of the way is directly in the Selenium
PhantomJS
implementation of its API can be, here do not post, read the next section will know how to write.
Question two: in
Selenium
Obtained in
PhantomJS
The results of the API execution?
Fortunately Selenium
with a get_log
function, such as I monitor the ' http://127.0.0.1:8000/web/index.php ' page to send out all requests, if the original phantomjs
, very good, as follows:
varrequire(‘webpage‘);varfunction (request) { console.log(‘Request ‘ + request.url);};......................
So the API that we call directly in Selenium
PhantomJS
is good. As follows
fromSeleniumImportWebdriverImportRequestsr=requests.session () Service_args=[]service_args.append ('--load-images=no ')# #关闭图片加载Service_args.append ('--disk-cache=yes ')# #开启缓存Service_args.append ('--ignore-ssl-errors=true ')# #忽略https错误D=webdriver. PHANTOMJS ("Phantomjs", Service_args=service_args) data={"username":"123","Password":"123","Login":"1"}result=r.post ("http://127.0.0.1:8000/web/login.php", Data=data) cookies=r.cookies.get_dict () forIinchCookies:d.add_cookie ({' name 'I' value ': Cookies[i],' path ':'/',' domain ':' 127.0.0.1 '}) script ="var page=this;page.onresourcerequested = function (Request) {Console.log (request.url);};"d.command_executor._commands[' Executephantomscript '] = (' POST ','/session/$sessionId/phantom/execute ') D.execute (' Executephantomscript ', {' script ': script,' args ': []}) D.get ("http://127.0.0.1:8000/web/index.php")PrintD.page_sourced.quit ()
The above code can actually be executed in real time, but there is no way to get the result if it is written like this.
Here's a function that needs to be get_log
improved as follows:
fromSeleniumImportWebdriverImportRequestsr=requests.session () Service_args=[]service_args.append ('--load-images=no ')# #关闭图片加载Service_args.append ('--disk-cache=yes ')# #开启缓存Service_args.append ('--ignore-ssl-errors=true ')# #忽略https错误D=webdriver. PHANTOMJS ("Phantomjs", Service_args=service_args) data={"username":"123","Password":"123","Login":"1"}result=r.post ("http://127.0.0.1:8000/web/login.php", Data=data) cookies=r.cookies.get_dict () forIinchCookies:d.add_cookie ({' name 'I' value ': Cookies[i],' path ':'/',' domain ':' 127.0.0.1 '}) script ="var page=this;page.onresourcerequested = function (Request) {Page.browserLog.push (request.url);};"d.command_executor._commands[' Executephantomscript '] = (' POST ','/session/$sessionId/phantom/execute ') D.execute (' Executephantomscript ', {' script ': script,' args ': []}) D.get ("http://127.0.0.1:8000/web/index.php")PrintD.page_sourcePrintD.get_log (' Browser ') D.quit ()
In the JS script we call page.browserLog.push
, and then in the Python script we get_log(‘browser‘)
go to get to communicate with each other, certainly there are other ways, but I did not find .... Stiff ....
Postscript
Again this crawl sitemaps and request parameters of the small script, think of or feel very troublesome, although a few problems have a corresponding solution, and good or bad, plus today a little research on the optimization of Python calls, or a little phantom
bit of confidence, But feel to be integrated into one or will be very troublesome, efficiency problem is one, can really exactly grasp the complete is another, slowly to put, did not think originally thought is not the problem finally became one of my biggest trouble. Stiff... Finally, the Phantomjs of the soundtrack is really much more comfortable than the selenium. Now if you want to think about each link at the very beginning, if you think about it will not be used Python, probably will use Nodejs, on, recently a little research on some of the Nodejs infiltration and attack methods, follow-up to share out, I hope that the teachers can help guide guidance.
Selenium executing the PHANTOMJS API and getting execution results