Article starting personal blog: http://zmister.com/archives/179.html
Python crawler, GUI development, penetration testing, machine learning, all in http://zmister.com/
In the process of writing crawlers, for system environment or efficiency, we often use PHANTOMJS as selenium-manipulated browser webdriver, rather than using Chrome or Firefox webdriver directly, although the latter is more intuitive.
Although many advantages of PHANTOMJS, but also a lot of shortcomings, there is a disadvantage is not known as a disadvantage is that Phantomjs browser logo is "Phantomjs" (Brave to do their own wrong ...) :))
PHANTOMJS's logo is not a problem, but in a growing number of sites constantly upgrading their anti-crawler technology, PHANTOMJS obviously become a "requests" as the target.
As long as the server backstage to identify the visitor's user-agent as PHANTOMJS, it is possible that the server is identified as a reptile behavior, which causes the crawler to fail.
As with modifying the header field in requests to disguise as a browser, we can modify Phantomjs's browser identity to the identity of any browser in selenium. Here's a look at:
PHANTOMJS's browser ID
First look at Phantomjs's browser logo.
Http://service.spiritsoft.cn/ua.html is a Web site that obtains the browser identity user-agent, which displays the identity of the currently used browser:
650) this.width=650; "Src=" http://upload-images.jianshu.io/upload_images/38544-f3a74c04dc02c3d6.png?imageMogr2/ auto-orient/strip%7cimageview2/2/w/1240 "alt=" 1240 "/>
We use Selunium to manipulate PHANTOMJS to access http://service.spiritsoft.cn/ua.html and see the results returned:
# coding:utf-8from selenium import webdriverfrom bs4  IMPORT BEAUTIFULSOUPDEF DEFAULTPHANTOMJS (): driver = Webdriver. Phantomjs (executable_path=r "D:\phantomjs.exe") driver.get (' http://service.spiritsoft.cn /ua.html ') source = driver.page_source soup = beautifulsoup (source, ' lxml ') user_agent = soup.find_all (' TD ', attrs={' Style ': ' Height:40px;text-align:center;font-size:16px;font-weight:bolder;color:red; '}) for u in user_agent: Print (U.get_text (). replace (' \ n ', '). Replace (' ', ') ') driver.close ()
650) this.width=650; "Src=" http://upload-images.jianshu.io/upload_images/38544-aa4210940841953c.png?imageMogr2/ auto-orient/strip%7cimageview2/2/w/1240 "alt=" 1240 "/>
It is obvious that there are traces of PHANTOMJS.
Next, we modify the browser identity of the PHANTOMJS.
Disguised as Chrome
Introduction of a key module--desiredcapabilities:
From selenium.webdriver.common.desired_capabilities import desiredcapabilities
What is this module for? Let's look at the source code explanation:
Class Desiredcapabilities (object): "" "Set of default supported desired capabilities. Use the as a starting point to creating a desired capabilities object for requesting remote Webdrivers for connecting To selenium server or selenium grid.
Describes the key-value pairs of a series of encapsulated browser properties, which are roughly used to set webdriverde properties. We use it to set Phantomjs's user-agent.
First, convert desiredcapabilities to a dictionary for easy addition of key-value pairs
Dcap = Dict (DESIREDCAPABILITIES.PHANTOMJS)
Then add a key-value pair that the browser identifies:
dcap[' phantomjs.page.settings.userAgent ' = (' mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/62.0.3202.94 safari/537.36 ')
Finally, set the parameters in the instantiation PHANTOMJS:
Driver = Webdriver. Phantomjs (executable_path=r "D:\phantomjs.exe", desired_capabilities=dcap,service_args=['--ignore-ssl-errors= True '])
The complete code is as follows:
# coding:utf-8from selenium import webdriverfrom bs4 import Beautifulsoupfrom selenium.webdriver.common.desired_capabilities import desiredcapabilitiesdef edituseragent (): dcap = dict (DESIREDCAPABILITIES.PHANTOMJS) dcap[' phantomjs.page.settings.userAgent '] = (' mozilla/5.0 (windows nt 10.0; WOW64) AppleWebKit/537.36 (Khtml, like gecko) Chrome/62.0.3202.94 safari/537.36 ') driver = webdriver. Phantomjs (executable_path=r "D:\phantomjs.exe", desired_capabilities=dcap,service_args=['--ignore-ssl-errors= True ']) driver.get (' http://service.spiritsoft.cn/ua.html ') Source = driver.page_source soup = beautifulsoup (source, ' lxml ') user_agent = soup.find_all (' TD ', attrs={ ' style ': ' height:40px;text-align:center;font-size:16px; font-weight:bolder;color:red; '}) for u in user_agent: Print (U.get_text (). replace (' \ n ', '). Replace (' ', ')) driver.close () if __name__ == ' __main__ ': edituseragent ()
Let's Run the code:
650) this.width=650; "Src=" http://upload-images.jianshu.io/upload_images/38544-6ee16cd3ffaced98.png?imageMogr2/ auto-orient/strip%7cimageview2/2/w/1240 "alt=" 1240 "/>
The PHANTOMJS was successfully identified for Chrome browser.
Isn't it simple?
This article is from the "Mr. State" blog, reprint please contact the author!
Selenium Phantomjs disguised as a Chrome browser by modifying the User-agent logo in the