PYTHON+SELENIUM+PHANTOMJS crawling Web pages loading content dynamically

Source: Internet
Author: User
Tags xpath

In general, we use Python's third-party library requests and framework scrapy to crawl resources on the web, but the pages that are designed to render JavaScript cannot be crawled, and we use Web Automation testing tools selenium+ No interface browser Phantomjs to crawl JavaScript rendered pages, below to implement a simple crawl

Environment construction

Preparation Tool: Python3.5,selenium,phantomjs

I've got the python3.5 in My computer.

Installing Selenium

PIP3 Install Selenium

Installing PHANTOMJS

Follow the system environment download PHANTOMJS, after the download is complete, unzip the phantomjs.exe into the Python script folder

Using SELENIUM+PHANTOMJS to implement simple crawlers
From selenium import webdriverdriver = Webdriver. PHANTOMJS () driver.get (' http://www.baidu.com ')   #加载网页data = Driver.page_source   #获取网页文本driver. save_ Screenshot (' 1.png ')   #保存print (data) driver.quit ()
Selenium+phantomjs Some of the methods used to set the user-agent of the request head
From selenium import webdriverfrom selenium.webdriver.common.desired_capabilities Import Desiredcapabilitiesdcap = Dict (DESIREDCAPABILITIES.PHANTOMJS)  #设置useragentdcap [' phantomjs.page.settings.userAgent '] = (' mozilla/5.0 ( Macintosh; Intel Mac OS X 10.9; rv:25.0) gecko/20100101 firefox/25.0 ')  #根据需要设置具体的浏览器信息driver = Webdriver. PHANTOMJS (desired_capabilities=dcap)  #封装浏览器信息driver. Get (' http://www.baidu.com ')   #加载网页data = Driver.page_ SOURCE   #获取网页文本driver. Save_screenshot (' 1.png ')   #保存print (data) driver.quit ()
Request Timeout setting

There are three time-related methods in the Webdriver class:

1.pageLoadTimeout sets the time-out for full loading of the page, full rendering complete, both synchronous and asynchronous scripts executed

2.setScriptTimeout setting the time-out for asynchronous scripts

3.implicitlyWait Intelligent Wait time for object recognition

From selenium import webdriverdriver = Webdriver. PHANTOMJS () driver.set_page_load_timeout (5)  #设置超时时间driver. Get (' http://www.baidu.com ') print (driver.title) Driver.quit ()
Set browser window size The call to launch the browser is not full-screen, sometimes it affects some of our operations, so we can set the full screen
Driver.maximize_window ()  #设置全屏driver. Set_window_size (' 480 ', ' + ') #设置浏览器宽480, 800 higher
Element positioning
From selenium import webdriverdriver = Webdriver. PHANTOMJS () driver.set_page_load_timeout (5) driver.get (' http://www.baidu.com ') Try:    driver.get (' http:// Www.baidu.com ')    driver.find_element_by_id (' kw ')  # locate    driver.find_element_by_class_name (' S_ipt ') by ID  # Locate    driver.find_element_by_name (' WD ') by the class attribute  # by the Label Name property to locate    Driver.find_element_by_tag_name ( ' Input ')  # position the    driver.find_element_by_css_selector (' #kw ') via the Tag property  # Driver.find_element in CSS mode    _by_xpath ("//input[@id = ' kw ']")  # position Driver.find_element_by_link_text by XPath    ("Stick")  # Position by XPath    Print (driver.find_element_by_id (' kw '). tag_name) # Gets the type of label except Exception as E:    print (e) driver.quit ()
Action browser forward or backward
From selenium import webdriverdriver = Webdriver. Phantomjs () Try:    driver.get (' http://www.baidu.com ')   #访问百度首页    driver.save_screenshot (' 1.png ')    Driver.get (' http://www.sina.com.cn ') #访问新浪首页    driver.save_screenshot (' 2.png ')    driver.back ()                           #回退到百度首页    driver.save_screenshot (' 3.png ')    driver.forward ()                        #前进到新浪首页    driver.save_screenshot (' 4.png ') Except Exception as E:    print (e) driver.quit ()

PYTHON+SELENIUM+PHANTOMJS crawling Web pages loading content dynamically

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.