In general, we use Python's third-party library requests and framework scrapy to crawl resources on the web, but the pages that are designed to render JavaScript cannot be crawled, and we use Web Automation testing tools selenium+ No interface browser Phantomjs to crawl JavaScript rendered pages, below to implement a simple crawl
Environment construction
Preparation Tool: Python3.5,selenium,phantomjs
I've got the python3.5 in My computer.
Installing Selenium
PIP3 Install Selenium
Installing PHANTOMJS
Follow the system environment download PHANTOMJS, after the download is complete, unzip the phantomjs.exe into the Python script folder
Using SELENIUM+PHANTOMJS to implement simple crawlers
From selenium import webdriverdriver = Webdriver. PHANTOMJS () driver.get (' http://www.baidu.com ') #加载网页data = Driver.page_source #获取网页文本driver. save_ Screenshot (' 1.png ') #保存print (data) driver.quit ()
Selenium+phantomjs Some of the methods used to set the user-agent of the request head
From selenium import webdriverfrom selenium.webdriver.common.desired_capabilities Import Desiredcapabilitiesdcap = Dict (DESIREDCAPABILITIES.PHANTOMJS) #设置useragentdcap [' phantomjs.page.settings.userAgent '] = (' mozilla/5.0 ( Macintosh; Intel Mac OS X 10.9; rv:25.0) gecko/20100101 firefox/25.0 ') #根据需要设置具体的浏览器信息driver = Webdriver. PHANTOMJS (desired_capabilities=dcap) #封装浏览器信息driver. Get (' http://www.baidu.com ') #加载网页data = Driver.page_ SOURCE #获取网页文本driver. Save_screenshot (' 1.png ') #保存print (data) driver.quit ()
Request Timeout setting
There are three time-related methods in the Webdriver class:
1.pageLoadTimeout sets the time-out for full loading of the page, full rendering complete, both synchronous and asynchronous scripts executed
2.setScriptTimeout setting the time-out for asynchronous scripts
3.implicitlyWait Intelligent Wait time for object recognition
From selenium import webdriverdriver = Webdriver. PHANTOMJS () driver.set_page_load_timeout (5) #设置超时时间driver. Get (' http://www.baidu.com ') print (driver.title) Driver.quit ()
Set browser window size
The call to launch the browser is not full-screen, sometimes it affects some of our operations, so we can set the full screen
Driver.maximize_window () #设置全屏driver. Set_window_size (' 480 ', ' + ') #设置浏览器宽480, 800 higher
Element positioning
From selenium import webdriverdriver = Webdriver. PHANTOMJS () driver.set_page_load_timeout (5) driver.get (' http://www.baidu.com ') Try: driver.get (' http:// Www.baidu.com ') driver.find_element_by_id (' kw ') # locate driver.find_element_by_class_name (' S_ipt ') by ID # Locate driver.find_element_by_name (' WD ') by the class attribute # by the Label Name property to locate Driver.find_element_by_tag_name ( ' Input ') # position the driver.find_element_by_css_selector (' #kw ') via the Tag property # Driver.find_element in CSS mode _by_xpath ("//input[@id = ' kw ']") # position Driver.find_element_by_link_text by XPath ("Stick") # Position by XPath Print (driver.find_element_by_id (' kw '). tag_name) # Gets the type of label except Exception as E: print (e) driver.quit ()
Action browser forward or backward
From selenium import webdriverdriver = Webdriver. Phantomjs () Try: driver.get (' http://www.baidu.com ') #访问百度首页 driver.save_screenshot (' 1.png ') Driver.get (' http://www.sina.com.cn ') #访问新浪首页 driver.save_screenshot (' 2.png ') driver.back () #回退到百度首页 driver.save_screenshot (' 3.png ') driver.forward () #前进到新浪首页 driver.save_screenshot (' 4.png ') Except Exception as E: print (e) driver.quit ()
PYTHON+SELENIUM+PHANTOMJS crawling Web pages loading content dynamically