Python crawler Development "1th" "Dynamic HTML, Selenium, PHANTOMJS"

Last Update:2018-08-12 Source: Internet

Author: User

Tags tag name wrapper xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Javascript

JavaScript is the most commonly used client scripting language on the Web. It collects the user's tracking data, does not need to overload the page to submit the form directly, embeds the multimedia file in the page, and even runs the web game.

We can see it in the tag of the source code of the webpage <scripy> , for example:

<script type= "Text/javascript" src= "https://statics.huxiu.com/w/mini/static_2015/js/sea.js?v=201601150944" ></script>

Jquery

jquery is a very common library, with 70% of the most popular sites (about 2 million) and about 30% of other sites (about 200 million) in use. A site that uses jquery is characterized by the inclusion of jquery portals in the source code, such as:

<script type= "Text/javascript" src= "https://statics.huxiu.com/w/mini/static_2015/js/jquery-1.11.1.min.js?v= 201512181512 "></script>

If you see jQuery on a website, take extra care when collecting data from this site. JQuery can dynamically create HTML content that is not displayed until the JavaScript code is executed. If you capture page content in a traditional way, you can only get the content on the page before the JavaScript code executes.

Ajax

The only way we communicate with a Web server is to make an HTTP request to get a new page. If after submitting the form, or after obtaining information from the server, the site's page does not need to be refreshed, then the website you visit is using AJAX technology.

Ajax is not a language, but a series of techniques used to complete a network task (which can be thought of as a network data collection). The Ajax full name is asynchronous JavaScript and XML (asynchronous JavaScript and XML), and the Web site does not need to use a separate page request to interact with the Web server (sending and receiving information).

Dhtml

As with Ajax, Dynamic HTML (dynamically HTML, DHTML) is also a collection of techniques for solving network problems. DHTML is the HTML element (HTML, CSS, or both) that changes the page in the client language. For example, the button on the page will only appear after the user moves the mouse, the background color may change every click, or use an AJAX request to trigger the page to load a new content, whether the Web page is DHTML, the key is to see if you use JavaScript control HTML and CSS elements.

Pages that use Ajax or DHTML technology to change/load content may have some collection methods. But to work with Python, here's how:

①, capturing content directly from JavaScript code (time consuming and laborious)

②, run JavaScript in Python's third-party library, and directly capture the pages you see in your browser (this is available).

Selenium

Selenium can use our instructions to let the browser automatically load the page, get the required data, or even screen screenshots, or determine whether certain actions on the site occur.

Selenium library Https://pypi.python.org/simple/selenium, available with third-party manager pip command installation:pip install selenium

Selenium Official Reference Document: Http://selenium-python.readthedocs.io/index.html

Phantomjs

PHANTOMJS is a webkit-based "No Interface" (headless) browser that loads the site into memory and executes JavaScript on the page because it doesn't show a graphical interface, so it's more efficient to run than a full browser.

If you combine Selenium and PHANTOMJS, you can build a process that handles Javascrip, cookies, headers, and anything that our real users need to do.

PHANTOMJS:HTTP://PHANTOMJS.ORG/DOWNLOAD.HTML,PHANTOMJS is a fully functional (though no interface) browser rather than a Python library, so it doesn't need to be installed like any other library in Python. However, you can use selenium to call PHANTOMJS directly using the

PHANTOMJS Official Reference Document: Http://phantomjs.org/documentation

Basis

There is an API called Webdriver in the Selenium library. Webdriver is a bit like a browser that can load a Web site, but it can also be used like BeautifulSoup or other Selector objects to find page elements, interact with elements on the page (send text, click, etc.), and perform other actions to run a web crawler.

# IPython2 Test Code # import Webdriverfrom Selenium import webdriver# to invoke keyboard key operation need to introduce keys package from Selenium.webdriver.common.keys Import keys# invokes the environment variable specified by the Phantomjs browser to create the browser object driver = Webdriver. PHANTOMJS () # If no PHANTOMJS position is specified in the environment variable # driver = Webdriver. Phantomjs (executable_path= "./phantomjs")) # The Get method waits until the page is fully loaded before proceeding with the program, and the test will typically choose here Time.sleep (2) driver.get ("http:/ /www.baidu.com/") # Gets the text content of the ID label for page name wrapper data = driver.find_element_by_id (" wrapper "). text# Print data content Print data# print page title "Baidu, you know" print driver.title# generate the current page snapshot and save Driver.save_screenshot ("Baidu.png") # id= "KW" is Baidu search input box, enter the string "Great Wall" driver.find_element_by_id ("kw"). Send_keys (U "Great Wall") # id= "Su" is Baidu Search button, click () is Analog click driver.find_element_by_id ("su"). Click () # Get a new page snapshot driver.save_screenshot ("Great Wall. png") # Print the page after rendering the source code print driver.page_source# Get the current page Cookieprint Driver.get_cookies () # Ctrl + A Select all input box content driver.find_element_by_id ("kw"). Send_keys (Keys.control, ' a ') # Ctrl+x Cut the input box contents driver.find_element_by_id ("kw"). Send_keys (Keys.control, ' x ') # input box re-enter content driver.find_element_by_id ("kw"). Send_Keys ("Itcast") # Analog Enter enter key driver.find_element_by_id ("Su"). Send_keys (Keys.return) # Clear the contents of the input box Driver.find_element_by_ ID ("kw"). Clear () # Generate a new page snapshot Driver.save_screenshot ("Itcast.png") # Gets the current Urlprint driver.current_url# close the current page, if there is only one page, Closes browser # driver.close () # Close browser driver.quit ()

Page actions

Selenium's Webdriver provides a variety of ways to find elements, assuming there is a form input box below:

<input type= "text" name= "User-name" id= "Passwd-id"/> Then: # Gets the id Tag value element = driver.find_element_by_id ("Passwd-id ") # Gets the name tag value element = Driver.find_element_by_name (" User-name ") # Gets the label name value of element = Driver.find_elements_by_tag_name (" Input ") # can also be matched by XPath to match element = Driver.find_element_by_xpath ("//input[@id = ' Passwd-id '] ")

Positioning UI Elements (webelements)

With regard to the selection of elements, like the following API single element selection

find_element_by_id
Find_elements_by_name
Find_elements_by_xpath
Find_elements_by_link_text
Find_elements_by_partial_link_text
Find_elements_by_tag_name
Find_elements_by_class_name
Find_elements_by_css_selector

by Id<div id= "Coolestwidgetevah" >...</div> implement element = driver.find_element_by_id ("Coolestwidgetevah")-- ----------------------or-------------------------from selenium.webdriver.common.by import byelement = Driver.find_ Element (By=by.id, value= "Coolestwidgetevah") by Class name<div class= "Cheese" ><span>cheddar</span ></div><div class= "cheese" ><span>Gouda</span></div> implementation cheeses = Driver.find_ Elements_by_class_name ("cheese")------------------------or-------------------------from Selenium.webdriver.common.by Import bycheeses = driver.find_elements (by.class_name, "cheese") by Tag name<iframe SRC = "..." ></iframe> implement frame = Driver.find_element_by_tag_name ("iframe")------------------------or---------- ---------------from selenium.webdriver.common.by Import byframe = Driver.find_element (By.tag_name, "iframe") by NAME <input name= "Cheese" type= "text"/> Implement cheese = Driver.find_element_by_name ("cheese")------------------------oR-------------------------from selenium.webdriver.common.by import Bycheese = Driver.find_element (by.name, "cheese") by Link text<a href= "Http://www.google.com/search?q=cheese" >cheese</a> implementation cheese = Driver.find_element_by _link_text ("cheese")------------------------or-------------------------from selenium.webdriver.common.by import Bycheese = Driver.find_element (by.link_text, "cheese") by Partial LINK text<a href= "http://www.google.com/search?q= Cheese ">search for cheese</a>> implementation cheese = Driver.find_element_by_partial_link_text (" cheese ")----------- -------------or-------------------------from selenium.webdriver.common.by import Bycheese = Driver.find_element ( By.partial_link_text, "cheese") by Css<div id= "food" ><span class= "Dairy" >milk</span><span class = "Dairy aged" >cheese</span></div> implementation cheese = Driver.find_element_by_css_selector ("#food Span.dairy.aged ")------------------------or-------------------------from selenium.webdriveR.common.by Import Bycheese = Driver.find_element (By.css_selector, "#food span.dairy.aged") by Xpath<input type= " Text "Name=" Example "/><input type=" text "name=" other "/> Implementation inputs = Driver.find_elements_by_xpath ("//input ") ------------------------or-------------------------from selenium.webdriver.common.by import byinputs = Driver.find_ Elements (By.xpath, "//input")

Mouse Action Chain

To simulate some mouse actions on the page, you can do this by importing the Actionchains class:

#导入 Actionchains class from selenium.webdriver import actionchains# mouse move to ac position ac = Driver.find_element_by_xpath (' element ') Actionchains (Driver). Move_to_element (AC). Perform () in AC position click ac = Driver.find_element_by_xpath ("Elementa") Actionchains (Driver). Move_to_element (AC). Click (AC). Perform () in AC position double-click AC = Driver.find_element_by_xpath ("Elementb" ) Actionchains (Driver). Move_to_element (AC). Double_click (AC). Perform () in AC position right-click AC = Driver.find_element_by_xpath (" Elementc ") Actionchains (Driver). Move_to_element (AC). Context_click (AC). In AC position left click hold AC = Perform Element_by_xpath (' elementf ') actionchains (driver). Move_to_element (AC) click_and_hold (AC). Perform () # drag AC1 to AC2 Location AC1 = Driver.find_element_by_xpath (' elementd ') AC2 = Driver.find_element_by_xpath (' Elemente ') actionchains (driver) . Drag_and_drop (AC1, AC2). Perform ()

Fill out the form

When you encounter the drop-down <select> </select> box for a label, clicking the option in the drop-down box is not necessarily possible.

<select id="Status" class="Form-control Valid"Onchange=""Name="Status"> <option value=""></option> <option value="0"> Not audited </option> <option value="1"> Preliminary examination through </option> <option value="2"> Review by </option> <option value="3"> Audit does not pass </option></select>

Selenium specifically provides a select class to handle the drop-down box. In fact, Webdriver provides a method called Select that can help us accomplish these things:

# import Select class from Selenium.webdriver.support.ui import select# find the tab for name select = Select (Driver.find_element_by_name (' s Tatus ') # Select.select_by_index (1) select.select_by_value ("0") Select.select_by_visible_text (U "not audited")

The above is the choice of three kinds of drop-down box, it can be selected according to the index, can be selected according to the value, can be selected according to the text. Attention:

Index indexes starting from 0

Value is a property value of the option tag, not the value displayed in the drop-down box

Visible_text is the value of the text in the option label, which is the value that is displayed in the drop-down box

　　select.deselect_all()全部取消选择

Pop-up window processing

When you trigger an event, the page appears with a popup prompt, and the following method is used to process the prompt or to get a hint:

Alert = Driver.switch_to_alert ()

Page switching

A browser will certainly have a lot of windows, so we must have a way to implement the window switch. Here's how to switch windows:

Driver.switch_to.window ("This is window name")

You can also use the Window_handles method to get the action object for each window. For example:

For handle in Driver.window_handles:    Driver.switch_to_window (handle)

Page forward and backward

To manipulate the forward and backward functions of the page:

Driver.forward ()     #前进driver. Back ()        # Rewind

For each cookie value on the page, use the following:

For Cookie in Driver.get_cookies ():    print "%s-,%s"% (cookie[' name '], cookie[' value ')

To delete cookies, use the following:

# by Namedriver.delete_cookie ("CookieName") # alldriver.delete_all_cookies ()

Page wait

At present, more and more Web pages use Ajax technology, so that the program can not determine when an element is fully loaded. If the actual page waits too long to cause a DOM element to come out, but your code uses the webelement directly, the Nullpointer exception is thrown.

In order to avoid the difficulty of locating this element, the probability of generating elementnotvisibleexception is increased. So Selenium provides two ways to wait, one for implicit waiting and one for explicit waiting.

The implicit wait is to wait for a specific time, and the explicit wait is to specify a condition until the condition is set up to continue execution.

1. Wait for an explicit

Explicitly wait for a condition to be specified, and then set the maximum wait time. If the element is not found at this time, then an exception is thrown.

From selenium import webdriverfrom selenium.webdriver.common.by import by# webdriverwait Library, responsible for loop waiting from Selenium.webdriver.support.ui Import webdriverwait# Expected_conditions class, responsible for the conditions of departure from Selenium.webdriver.support Import expected_conditions as Ecdriver = Webdriver. Chrome () driver.get ("http://www.xxxxx.com/loading") Try:    # page keeps looping until id= "mydynamicelement" appears    element = Webdriverwait (Driver). Until (        ec.presence_of_element_located ((by.id, "mydynamicelement"))    finally:    Driver.quit ()

If you do not write the parameter, the program defaults to 0.5s call once to see if the element has been generated, and if the original element is present, it will return immediately.

Here are some built-in wait conditions that you can call directly without having to write some wait conditions yourself.

 is  and enabled.staleness_ofelement_to_be_selectedelement_located_to_be_selectedelement_selection_state_to_ Beelement_located_selection_state_to_bealert_is_present

2. Implicit wait

The implicit wait is simple, which is to simply set a wait time in seconds.

From selenium import webdriverdriver = Webdriver. Chrome () driver.implicitly_wait () # Secondsdriver.get ("http://www.xxxxx.com/loading") Mydynamicelement = driver.find_element_by_id ("Mydynamicelement")

Of course, if not set, the default wait time is 0.

Python crawler Development "1th" "Dynamic HTML, Selenium, PHANTOMJS"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More