Python crawler accumulation (1) -------- use of selenium + python + PhantomJS and phantomjspython Crawler

Source: Internet
Author: User
Tags selenium python selenium grid

Python crawler accumulation (1) -------- use of selenium + python + PhantomJS and phantomjspython Crawler

Recently, as per the company's requirements, when I found that I did not find the js package address, I used selenium to crawl information. Link: python crawler practice (I) -------- China crop germplasm Information Network

1. Introduction to Selenium

What is Selenium? In a word, automated testing tools. It supports various browsers, including Chrome, Safari, Firefox, and other mainstream interface browsers. If you install a Selenium plug-in these browsers, You can conveniently test the Web interface. In other words, Selenium supports these browser drivers. Selenium supports development in multiple languages, such as Java, C, Ruby, etc. Is there Python? That is required! Install pip install selenium in cmd.

2. Why do crawlers use selenium?

For general websites, scrapy, requests, and beautifulsoup can be crawled, but some information needs to be displayed by executing JavaScript, and basically all you can see can be crawled, if you encounter this problem during learning, record it for later viewing.

Webdrive is a function in selenium:

1 from selenium import webdriver2 driver = webdriver. PhantomJS () 3 driver. get ('website ')

PhantomJS can be replaced with Chrome, Firefox, and Ie, but PhantomJS is a headless browser. It does not jump out of the corresponding browser and runs more efficiently. During debugging, you can switch to Chrome to facilitate debugging, and then to PhantomJS.

Iii. Introduction to PhantomJS

PhantomJS is a webkit-based JavaScript API. It uses QtWebKit as its core browser function and uses webkit to compile and interpret and execute JavaScript code. Anything you can do in a webkit-based browser. It is not only an invisible browser, but also provides CSS selectors, supports Web standards, DOM operations, JSON, HTML5, Canvas, SVG, etc, it also provides operations for processing file I/O, so that you can read and write files to the operating system. PhantomJS is widely used, such as front-end automated interface-free testing (combined with Jasmin), network monitoring, and Web screenshots.

Official PhantomJS address: http://phantomjs.org /.

Official PhantomJS API: http://phantomjs.org/api /.

Official PhantomJS example: http://phantomjs.org/examples /.

PhantomJS GitHub: https://github.com/ariya/phantomjs /.

4. Install PhantomJS

In the Windows 7 system, move the downloaded phantomjs.exe file to the Script in your python folder. (Download phantomjs-2.1.1-windowns.zip link: http://pan.baidu.com/s/1c8HeBc password: 2zm4)

Small test:

1 from selenium import webdriver2 driver = webdriver. phantomJS () 3 driver. get ("http://hotel.qunar.com/") 4 data = driver. title5 print data6 7 # output 8 [] Hotel Reservation, hotel query-Qunar.com
V. operation practices
1 #-*-coding: UTF-8-*-2 from selenium import webdriver 3 import time 4 import win32api 5 import re 6 import win32con 7 browser = webdriver. phantomJS () 8 ''' the screen of PhantomJS is scrolled at the bottom, while Chrome does not have '''9 10 browser. get ("http://flight.qunar.com/") # Open where to visit the official website 11 a = browser. get_screenshot_as_file ("E:/Python27/test2.jpg") # screen 12 13 browser. find_element_by_id ("searchTypeRnd "). click () # click back and forth to 14 browser. find_element_by_xpath ('// * [@ id = "dfsForm"]/div [2]/div [1]/div/input '). clear () # first clear the input box. The default value is 15 browsers. find_element_by_xpath ('// * [@ id = "dfsForm"]/div [2]/div [1]/div/input '). send_keys (u "Beijing") # enter the start point location 16 17''' here for win32api reference, refer to the relevant manual-The following is the keyboard operation '''18 time. sleep (0.5) 19 win32api. keybd_event (,) # Press enter key 20 # press a key win32api. keybd_event (key bit code, 0) 21 win32api. keybd_event (108,0, win32con. KEYEVENTF_KEYUP, 0) # Release key 22 # Release key win32api. keybd_event (key bit code, 0, win32con. KEYEVENTF_KEYUP, 0) 23 24 browser. find_element_by_xpath ('// * [@ id = "dfsForm"]/div [2]/div [2]/div/input '). clear () 25 browser. find_element_by_xpath ('// * [@ id = "dfsForm"]/div [2]/div [2]/div/input '). send_keys (u "Shanghai") # enter the end point 26 time. sleep (0.5) 27 win32api. keybd_event (,) # Press enter key 28 win32api. keybd_event (108,0, win32con. KEYEVENTF_KEYUP, 0) # release button 29 30 browser. find_element_by_xpath ('// * [@ id = "fromDate"]'). clear () 31 browser. find_element_by_xpath ('// * [@ id = "fromDate"]'). send_keys ("") # enter the departure time 32 # browser. find_element_by_xpath ('// * [@ id = "fromDate"]'). click () 33 browser. find_element_by_xpath ('// * [@ id = "toDate"]'). clear () 34 browser. find_element_by_xpath ('// * [@ id = "toDate"]'). send_keys ("") # enter the return time 35 # browser. find_element_by_xpath ('// * [@ id = "toDate"]'). click () 36 37 38 ''' method 2 location and time ''' 39 # browser. find_element_by_name ("name "). send_keys ("Beijing (BJS)") # set the value to 40 # browser. find_element_by_name ("pass "). send_keys ("Shanghai (SHA)") # set the value to 41 # browser. find_element_by_id ("txtAirplaneTime1 "). send_keys ("2016-12-19") # set the value to 42 43 browser. find_element_by_xpath ('// * [@ id = "dfsForm"]/div [4]/button '). click () # click the button to submit the form 44 browser. maximize_window () # maximum window size: 45, 46 ''save current webpage ''' 47 print (browser. current_url) # current url48 # browser. get ("http://www.ly.com/FlightQuery.aspx") # The cookie is saved in the object and can be directly accessed to the authentication page 49 data = browser. page_source.encode ("UTF-8", "ignore") 50 fh = open ("E:/python27/qun.html", "wb") 51 fh. write (data) 52 fh. close () 53 data2 = browser. page_source54 # print data255 a = browser. get_screenshot_as_file ("E:/Python27/test.jpg") 56 # print (browser. page_source) 57 58 ''' you can capture things ''' 59 60 browser later. quit ()
6. We recommend that you use selenium + webdriver + python as the learning materials of the wormhole blog.

Link:

Selenium automated testing tools: selnium 1.0 includes: selenium RC, selenium IDE, selenium GRID, selenium CORE

Webdriver google's automated testing framework (or set of standardized APIs)

Webdriver and seleinum have their own advantages. Both teams think that the merger will be even better, so:

Selenium 2.0 = selenium RC + webdriver

Selenium can be implemented in multiple languages: C #, java, python, ruby ....

Environment setup:

Selenium + python automated test environment setup Translation: selenium webdriver (python) ---------------- easy automation series directory ----------- easy automation --- selenium-webdriver (python) (1)

Start our first script:

  • Familiar with selenium python code styles
  • Time. sleep () add sleep time
  • Print Output Information
Easy automation --- selenium-webdriver (python) (2)
  • Print URL
  • Maximize the browser
  • Set the fixed width and height of the browser
  • Control browser forward and backward
Easy automation --- selenium-webdriver (python) (3)

* Simple object locating:

  • · Id
  • · Name
  • · Class name
  • · Link text
  • · Partial link text
  • · Tag name
  • · Xpath
  • · Css selector
Easy automation --- selenium-webdriver (python) (4)
  • Locate a group of elements
Easy automation --- selenium-webdriver (python) (5)
  • Hierarchical Positioning
Easy automation --- selenium-webdriver (python) (6)

Operation object:

  • · Click object
  • · Send_keys simulate key input on the object
  • · Clear clears the object content. If yes

Other common WebElement methods:

  • · Text: Get the text of this element
  • · Submit submission form
  • · Get_attribute
Easy automation --- selenium-webdriver (python) (7)

Multi-layer frame or window positioning:

  • Switch_to_frame ()
  • Switch_to_window ()

Smart wait:

  • Implicitly_wait ()
Easy automation --- selenium-webdriver (python) (8)

Call js Methods

Execute_script (script, * args)

Easy automation --- selenium-webdriver (python) (9)
  • Upload files
Easy automation --- selenium-webdriver (python) (10)
  • Process the drop-down list
  • Switch_to_alert ()
  • Accept ()
Easy automation --- selenium-webdriver (python) (11)
  • Control the scroll bar to the bottom
Easy automation --- selenium-webdriver (python) (12)
  • L keyboard key usage
  • L keyboard combination Key Usage
  • L send_keys () input Chinese running error
Selenium-webdriver (python) (13) -- cookie Processing
  • Driver. get_cookies () get cookie Information
  • Add_cookie (cookie_dict) adds session information to the cookie
  • Delete_cookie (name) delete a specific (Part) cookie
  • Delete_all_cookies () delete all cookies
Selenium-webdriver (python) (14th) -- webdriver Principle
  • Webdriver Principle Analysis

Selenium-webdriver (python) (15th) -- mouse event
  • Context_click () Right-click
  • Double-click double_click ()
  • Drag_and_drop () drag
Selenium-webdriver (python) (16) -- unittest framework
  • Analysis of unittest testing framework

Author: Jin Xiao
Source: http://www.cnblogs.com/jinxiao-pu
The copyright of this article is shared by the author and the blog. You are welcome to repost this article, but you must keep this statement without the author's consent and provide a connection to the original article on the article page.

Reference: http://www.cnblogs.com/zzhzhao/p/5380376.html

Http://www.cnblogs.com/BigFishFly/p/6380024.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.