@ Overview Usually the background of the major web site will have some of the reverse crawl mechanism, both for data security, but also in order to reduce the server pressure is usually the direction of the means to crawl, is to identify the non-browser client, and selenium do, is precisely the driver of the real browser to perform requests and operations But the signal is not from the mouse, but from the Selenium API (selenium is an automated test tool) all the natural user can do, selenium almost all driver browser to do, regardless of whether there is an interface, including input, click, Slide, and so on However, in the end is the mouse operation of the browser to launch the request or API, for the server, there is no difference so said: Life is difficult, do men difficult, do a background development of men more difficult, let us start to its implementation of the ravages of it
@ Some anecdotes earlier the popular combination is not selenium+chrome browser driver, but Selenium+phantomjs Phantomjs is a browser without interface, the industry called Headless Browser (headless), because there is no interface and rendering, Its speed is much better than the interface of the browser, which is exactly reptilian like, so outspoken
Later, Chrome and Firefox launched headless mode, and run very smoothly, Phantomjs has died, so we did not mention
@ Development environment (based on Ubuntu)
Install selenium:sudo pip install Selenium if not, install the Chrome browser (try to update to more than 58): Http://www.linuxidc.com/Linux/2016-05/131096.htm Install the Chrome browser driver (note that the latest version tail is 29 rather than 9): https://www.cnblogs.com/Lin-Yi/p/7658001.html
@ Guide Package
# import Selenium Browser driver interface from
Selenium import webdriver
# to invoke the keyboard key operation you need to introduce the keys package from
Selenium.webdriver.common.keys Import keys
# importing Chrome options from
selenium.webdriver.chrome.options Import Options
@ First Program: Crawl page content, generate page snapshots
# Create Chrome browser driven, headless mode (super cool)
chrome_options = options ()
chrome_options.add_argument ('--headless ')
driver = Webdriver. Chrome (chrome_options=chrome_options)
# load Baidu page
driver.get ("http://www.baidu.com/")
# Time.sleep (3)
# Get the text content of the ID label with the page name wrapper
data = driver.find_element_by_id ("wrapper"). Text
print (data)
# Print page title "Baidu, You Know"
print (driver.title)
# Generate the current page snapshot and save
driver.save_screenshot ("Baidu.png")
# Close browser
driver.quit ()
@ Analog user input and click Search, as in real person operation.
The # get method will wait until the page is fully loaded before continuing the program, where the test will typically select Time.sleep (2) driver.get ("http://www.baidu.com/" # id= "KW" is the Baidu search input box, the input string "program Ape" driver.find_element_by_id ("kw"). Send_keys (U "program Ape") # id= "Su" is the Baidu Search button, click () is a model
Click driver.find_element_by_id ("su"). Click () Time.sleep (3) # Get a new page snapshot Driver.save_screenshot ("Program ape. png")
# Print the page after rendering the source code print (driver.page_source) # get current page cookie print (Driver.get_cookies ()) # Ctrl + A full selection of input box contents driver.find_element_by_id ("kw"). Send_keys (Keys.control, ' a ') # ctrl+x cut input box content driver.find_element_by_id ("kw"). Send_keys (Keys.control, ' x ') # Enter the box to re-enter the content driver.find_element_by_id ("kw"). Send_keys ("Belle") # Simulate Enter enter key Dr iver.find_element_by_id ("su"). Send_keys (Keys.return) Time.sleep (3) # Clear the contents of the Input box driver.find_element_by_id ("kw").
Clear () # Generate a new page snapshot Driver.save_screenshot ("Belle. png") # Get current URL print (driver.current_url) # Close Browser Driver.quit ()
@ Impersonate User Login
# Load Microblogging login page
driver.get ("Http://passport.weibo.cn/signin/login?entry=mweibo&r=http%3A%2F%2Fweibo.cn%2F &backtitle=%ce%a2%b2%a9&vt= ")
Time.sleep (3)
# Find the input box, type the username and password
driver.find_element_by_id (' LoginName '). Send_keys ("worio.hainan@163.com")
driver.find_element_by_id (' Loginpassword '). Send_keys (" Qq94313805 ")
# Click on the login button
driver.find_element_by_id (' loginaction '). Click ()
time.sleep (3)
# Snapshot display has successfully logged on to
print (Driver.save_screenshot (' jietu.png '))
driver.quit ()
@ Use cookies to sign in
# Load know the home page, view the snapshot know at this time is not logged in status Driver.get ("https://www.zhihu.com") time.sleep (1) Print (Driver.save_screenshot ("en Ihu_nocookies.png ")] # operation Browser Login to know and grab packets of cookies zhihu_cookies = {# ' ALIYUNGF_TC ': ' Aqaaaar4yfoeswaanlfjcvrd4 Mkottxu ', ' l_n_c ': ' 1 ', ' q_c1 ': ' 8572377703ba49138d30d4b9beb30aed|1514626811000|1514626811000 ', ' r _cap_id ': ' mtc5m2y0oduzmjc0ndmznmfkntazzdbjztq4n2eymtc=|1514626811|a97b2ab0453d6f77c6cdefe903fd649ee8531807 ', ' cap_id ': ' yjqyztewowm4odlknge1mzkwztk3nmi5zgu0zty2yzm=|1514626811|d423a17b8d165c8d1b570d64bc98c185d5264b9a ', ' L_ cap_id ': ' Mge0njfjm2qxmzzinge1zwfjnjhhzmvkzwqwyzbkzjy=|1514626811|a1eb9f2b9910285350ba979681ca804bd47f12ca ', ' N_ C ': ' 1 ', ' d_c0 ': ' akchpgzg6qypthydpmyphxav-b9_iyyfspc=|1514626811 ', ' _xsrf ': ' ED7CBC18-03DD-47E9-9885-BBC1 c634d10f ', ' capsion_ticket ': ' 2|1:0|10:1514626813|14:capsion_ticket|44: Nwy5y2m0zgjizjflnddmmzlkywe0ymnjnja4mtrhmzy=|6cf7562d6b36288e86afaea5339b31f1dab2921d869ee45fa06d155ea3504fe1 ', ' _zap ': ' 3290e12b-64dc-4dae-a910-a32cc6e26590 ', ' z_c0 ': ' 2|1:0|10:15 14626827|4:Z_C0|92: mi4xym4wy0frqufbqufbb0tha2jnynbeq1lbqufcz0fsvk5dnjawv3dcb2xmbehxc1ftcejpenpplwlqss1qnm5kvefr| D89c27ab659ba979a977e612803c2c886ab802adadcf70bcb95dc1951bdfaea5 ', ' __utma ': ' 51854390.2087017282.1514626889.1514626889.1514626889.1 ', ' __UTMB ': ' 51854390.0.10.1514626889 ', ' __UTMC ': '
51854390 ', ' __utmz ': ' 51854390.1514626889.1.1.utmcsr=zhihu.com|utmccn= (referral) |utmcmd=referral|utmcct=/',
' __UTMV ': ' 51854390.100--|2=registration_date=20150408=1 ' 3=entry_date=20150408=1 ',} # Add all cookies from user login to current session For K, v. in Zhihu_cookies.items (): Driver.add_cookie ({' domain ': '. zhihu.com ', ' name ': K, ' value ': v}) # Visit the home page again and take a picture, at this time is already logged in state driver.get ("https://www.zhihu.com") time.sleep (3) Print (Driver.save_screenshot ("Zhihu_c Ookies.png ")) # Exit browser Driver.quit ()
@ Simulate scroll bar scrolling (this is hard to implement with conventional crawlers)
# load known from home
driver.get ("https://www.zhihu.com")
time.sleep (1)
# Loading local cookies Implementation login
for K, V in Zhihu_ Cookies.items ():
driver.add_cookie ({' domain ': '. zhihu.com ', ' name ': K, ' value ': v})
# to initiate access
again with login status Driver.get ("https://www.zhihu.com")
Time.sleep (3)
# Scrolls the page to the last, executes multiple for
I in range (3):
js = "var q= document.documentelement.scrolltop=10000 "
driver.execute_script (JS)
time.sleep (3)
# screenshot and exit, The page side scroll bar has slipped a lot of pixel
print (Driver.save_screenshot ("Zhihu_scroll.png"))
Driver.quit ()
@ while scrolling while loading the only product head page of women's pictures, is one side of the rolling Ajax asynchronous loading this by the conventional grasp of the bag to achieve a very cumbersome to use selenium we only need to simulate the user several times to pull down the scroll, after a period of time and then again take the rendering good page source You can crawl a picture like a static page. Similar to this operation, its essence is to hang, is almost impossible to defend
# only product will be women's pictures link can not directly get the # request only product will page Driver.get ("https://category.vip.com/search-3-0-1.html?q=3|30036| | &rp=30074|30063&ff=women|0|2|2&adidx=1&f=ad&adp=65001&adid=326630 ") Time.sleep (3) # gradually rolling
browser window, so that Ajax gradually loaded for the I in Range (1): js = "var q=document.body.scrolltop=" + STR (* i) # PHANTOMJS JS = "var q=document.documentelement.scrolltop=" + STR (* i) # Google and Firefox driver.execute_script (JS) pr Int (' ===================================== ') Time.sleep (3) # get page Source html = etree. HTML (driver.page_source) all_img_list = [] # get all pictures img_group_list = Html.xpath ("//img[contains" (@id, ' j_pic ')] ") # img_group_list = Html.xpath ("//img[starts-with (@id, ' j_pic ')] ") # Regular expression matches # img_group_list = Html.xpath (R ' Img[re:match (@id, "j_pic*")] ', namespaces={"re": "Http://exslt.org/regular-expressions"}) # collect all pictures linked to list for Img_ Group in Img_group_list:img_of_group = Img_group.xpath (".//@data-original |//@data-img-back |//@data-img-side") print (Img_of_group) all_img_list.append (' \ n '. Join (img_of_group) + ' \ n ') # writes the collected data to the file with open (' Vip.txt ', ' W ', encoding= ' Utf-8 ') as F:f.write (' \ n '. J Oin (All_img_list)) # Exit browser Driver.quit ()