Install and test Python Selenium library for capture Dynamic Web pages

Source: Internet
Author: User
Tags xpath xslt python web crawler

1. Introduction

the previous article " installation of Python3.5 for the preparation of web crawler programs "In the test of a small example of static web site to do a simple collection program, and Dynamic Web page because of the need to dynamically load JS to obtain data, so the use of Urllib direct OpenURL has been unable to meet the needs of the acquisition. Here we use the Selenium library, through which we can simply use the browser to load dynamic content for us, thus capturing the results.

In many cases, selenium and Phantomjs to capture Dynamic Web content (see my previously published case article), directly with Firefox or Chrome, can deal with a number of more complex acquisition situations, for example, in the anti-crawling aspect, Direct drive of ordinary browsers is less likely to be identified as crawlers. Therefore, this article explains with the Firefox collocation, the development difficulty has not increased.

Python version: Python3

2. Installing the Selenium Library

Use shortcut key win + R or right-click to start select Run, enter CMD carriage return, open Command Prompt window, enter command: Pip install Selenium



3. A simple crawler with Amazon products

3.1 Introduce the Gooseeker rule Extractor Module gooseeker.py (:   Https://github.com/FullerHua/gooseeker/tree/master/core ), custom Store directory, E:\demo\gooseeker.py

introduce the Gooseeker rule extractor, It eliminates the hassle of manually writing XPath or regular expressions, and automatically generates acquisition rules in a visually annotated manner, loading and using the collection rules via the API. Please refer to .

The following code is the use of the API, so there is no lengthy XPath and regular expression rules, the code of the API key and the crawl rule name can be used directly, which is a common test rules.

3.2 creates a. py suffix file in the Extractor module gooseeker.py sibling directory, such as E:\Demo\second.py here, and then opens in Notepad, typing code:

#-*-Coding:utf-8-*-# Use the Gsextractor class example program # to Webdriver Drive Firefox collection Amazon product List # XSLT saved in Xslt_bbs.xml # Capture results saved in Result-2.xml import osimport timefrom lxml import etreefrom Selenium import webdriverfrom gooseeker import Gsextra ctor# drive Firefox driver = Webdriver. Firefox () # Access and read Web content url = "Https://www.amazon.cn/b/ref=s9_acss_bw_ct_refTest_ct_1_h?_encoding=UTF8&node= 658810051&pf_rd_m=a1aj19psb66tgu&pf_rd_s=merchandised-search-5&pf_rd_r=wjandthe4nfayrr4p95k&pf _rd_t=101&pf_rd_p=289436412&pf_rd_i=658414051 "#开始加载driver. Get (URL) #等待2秒, more time-consuming customization of dynamic page loading Time.sleep (2) # Get Web content contents = Driver.page_source.encode (' utf-8 ') # Get Docmentdoc = etree. HTML (content) # Reference Extractor Bbsextra = Gsextractor () bbsextra.setxsltfromapi ("31d24931e043e2d5364d03b8ff9cc77e", "Amazon Books _ Test ") # Set XSLT crawl rule result = Bbsextra.extract (DOC) # Call the Extract method to extract the required content # current Directory Current_path = OS.GETCWD () File_path = present _path + "/result-2.xml" # Save result Open (File_path, "WB"). Write (Result) # prints out the results print (str (result). Encode (' GBK ', ' ignore '). Decode' GBK ')) 


3.3 Execute second.py, open the Command Prompt window, enter the directory where the second.py file is located, enter the command:p Ython second.py Enter

Note: Here is to drive Firefox as an example, so need to install Firefox, if not installed can go to the Firefox official website to download the installation


3.4 View Save the result file, go to the directory where the second.py file is located, find the XML file named Result-2



4. Summary

Install selenium, because the network causes failed once, after the installation is successful, if encountered multiple timeouts and installation failure, you can try to connect the VPN and then use the PIP command to install.

the next article, " quickly make rules and get extractor APIs Will explain: quickly create rules for a Web page structure and get the results you need to collect through the rules API.

5. History of Document Modification

2016-10-25:v1.0

6. Set Search Gooseeker source code download

Gooseeker Open source Python web crawler GitHub source

Install and test Python Selenium library for capture Dynamic Web pages

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.