PYTHON+SELENIUM+PHANTOMJS crawling Web pages loading content dynamically

Last Update:2017-06-12 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In general, we use Python's third-party library requests and framework scrapy to crawl resources on the web, but the pages that are designed to render JavaScript cannot be crawled, and we use Web Automation testing tools selenium+ No interface browser Phantomjs to crawl JavaScript rendered pages, below to implement a simple crawl

Environment construction

Preparation Tool: Python3.5,selenium,phantomjs

I've got the python3.5 in My computer.

Installing Selenium

PIP3 Install Selenium

Installing PHANTOMJS

Follow the system environment download PHANTOMJS, after the download is complete, unzip the phantomjs.exe into the Python script folder

Using SELENIUM+PHANTOMJS to implement simple crawlers

From selenium import webdriverdriver = Webdriver. PHANTOMJS () driver.get (' http://www.baidu.com ')   #加载网页data = Driver.page_source   #获取网页文本driver. save_ Screenshot (' 1.png ')   #保存print (data) driver.quit ()

Selenium+phantomjs Some of the methods used to set the user-agent of the request head

From selenium import webdriverfrom selenium.webdriver.common.desired_capabilities Import Desiredcapabilitiesdcap = Dict (DESIREDCAPABILITIES.PHANTOMJS)  #设置useragentdcap [' phantomjs.page.settings.userAgent '] = (' mozilla/5.0 ( Macintosh; Intel Mac OS X 10.9; rv:25.0) gecko/20100101 firefox/25.0 ')  #根据需要设置具体的浏览器信息driver = Webdriver. PHANTOMJS (desired_capabilities=dcap)  #封装浏览器信息driver. Get (' http://www.baidu.com ')   #加载网页data = Driver.page_ SOURCE   #获取网页文本driver. Save_screenshot (' 1.png ')   #保存print (data) driver.quit ()

Request Timeout setting

There are three time-related methods in the Webdriver class:

1.pageLoadTimeout sets the time-out for full loading of the page, full rendering complete, both synchronous and asynchronous scripts executed

2.setScriptTimeout setting the time-out for asynchronous scripts

3.implicitlyWait Intelligent Wait time for object recognition

From selenium import webdriverdriver = Webdriver. PHANTOMJS () driver.set_page_load_timeout (5)  #设置超时时间driver. Get (' http://www.baidu.com ') print (driver.title) Driver.quit ()

Set browser window size The call to launch the browser is not full-screen, sometimes it affects some of our operations, so we can set the full screen

Driver.maximize_window ()  #设置全屏driver. Set_window_size (' 480 ', ' + ') #设置浏览器宽480, 800 higher

Element positioning

From selenium import webdriverdriver = Webdriver. PHANTOMJS () driver.set_page_load_timeout (5) driver.get (' http://www.baidu.com ') Try:    driver.get (' http:// Www.baidu.com ')    driver.find_element_by_id (' kw ')  # locate    driver.find_element_by_class_name (' S_ipt ') by ID  # Locate    driver.find_element_by_name (' WD ') by the class attribute  # by the Label Name property to locate    Driver.find_element_by_tag_name ( ' Input ')  # position the    driver.find_element_by_css_selector (' #kw ') via the Tag property  # Driver.find_element in CSS mode    _by_xpath ("//input[@id = ' kw ']")  # position Driver.find_element_by_link_text by XPath    ("Stick")  # Position by XPath    Print (driver.find_element_by_id (' kw '). tag_name) # Gets the type of label except Exception as E:    print (e) driver.quit ()

Action browser forward or backward

From selenium import webdriverdriver = Webdriver. Phantomjs () Try:    driver.get (' http://www.baidu.com ')   #访问百度首页    driver.save_screenshot (' 1.png ')    Driver.get (' http://www.sina.com.cn ') #访问新浪首页    driver.save_screenshot (' 2.png ')    driver.back ()                           #回退到百度首页    driver.save_screenshot (' 3.png ')    driver.forward ()                        #前进到新浪首页    driver.save_screenshot (' 4.png ') Except Exception as E:    print (e) driver.quit ()

PYTHON+SELENIUM+PHANTOMJS crawling Web pages loading content dynamically

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More