Python crawler: Simulate browser behavior with selenium

Last Update:2017-12-23 Source: Internet

Author: User

Tags tag name xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A few days ago, a reader asked me a crawler problem, is crawling to Baidu Bar home hot dynamic below the picture, crawling images are always crawling incomplete, less than the home page to see. The reason he also probably analyzed the next, that is, the image behind the dynamic loading. His question is how to crawl this part of the dynamically loaded picture.

Analysis

His code is relatively simple, mainly has the following steps: Using the BeautifulSoup library, open Baidu post-page address, and then get the label id new_list under the label img , and finally img save the picture of the label.

= {    ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36‘}data=requests.get("https://tieba.baidu.com/index.html",headers=headers)html=BeautifulSoup(data.text,‘lxml‘)

As mentioned earlier, some of the images are dynamically loaded, so first we have to figure out how this part of the picture is dynamically loaded. In the browser to open the homepage of Baidu Bar, you can obviously see, scroll down the scroll bar, when scrolling to the bottom, the scroll bar shortened, and moved up a distance. This phenomenon is also a DOM manifestation of the dynamic addition of elements to the html document. Loading data dynamically is simply ajax a request, but ajax essentially a XMLHttpRequest request (short xhr ). In Google Chrome, we can monitor XHR requests via the developer Tools network panel.

When the first page is opened xhr , the request is not related to the image to be crawled.

Scroll bar scroll down to the bottom for the 1th time, here's what's requested is the 20-40 top dynamic, containing the image to crawl.

Scroll bar scroll down to the bottom for the 2nd time, here's what's requested is the 40-60 top dynamic, containing the image to crawl. And the return has_more:false shows no more data.

The scroll bar scrolls down to the bottom for the 3rd time, no more xhr requests.

Solution Solutions

According to the above analysis, we have learned that BeautifulSoup only when using the crawler, only crawl into the 1-20 hot dynamic inside the picture. In order to crawl into the full hot dynamic inside the picture, we need to simulate the browser scroll bar scrolling, so that the page to trigger xhr the request for more hot news.

In Python, you can use the library if you want to emulate the behavior of your browser selenium . seleniumLibrary is an automated test framework that can be used to simulate the various behaviors of the test browser, where we use it to simulate the browser to open the homepage of Baidu Bar, and simulate the scroll bar scroll down to the bottom of the operation.

Installation

pip install selenium

Download Browser Driver

Firefox driver, which is: https://github.com/mozilla/geckodriver/releases
Google Browser driver, which is: http://chromedriver.storage.googleapis.com/index.html?path=2.33/
Opera browser driver, which is: https://github.com/operasoftware/operachromiumdriver/releases

Download the driver files from the above address against your computer's installed browser and the corresponding version, or you can download the drivers from my GitHub project (address: Https://github.com/Sesshoumaru/attachments/tree /master/selenium%20webdriver). After the download is unpacked, add the directory where you are adding the system environment variables. Of course you can also put the downloaded driver into the directory of the Python installation directory lib , because it already exists in the environment variable (that's what I do).

Simulating browser behavior using python code

To selenium define a specific object using the first one, it is browser defined as the driver of the specific browser and browser installed on your computer. Here's an example of Firefox:

fromimport= webdriver.Firefox()

Re-simulation open bar Home:

browser.get("https://tieba.baidu.com/index.html")

Then simulate the scroll bar to scroll to the bottom

forinrange(15):    browser.execute_script(‘window.scrollTo(0, document.body.scrollHeight)‘)    time.sleep(1)

Finally, then use BeautifulSoup , parse the picture label:

="lxml"= html.select("#new_list li img")

A few notes.

Browser and browser drivers must be installed, and browser and browser drivers should be fitted to the

即如果使用谷歌浏览器模拟网页行为，则需要下载谷歌浏览器驱动；如果使用火狐浏览器模拟网页行为，则需要下载火狐浏览器驱动

The directory where the browser driver is located to specify the drive path in the environment variable, or when defining the browser browser

Selenium more usage Find elements

 fromSeleniumImportWebdriverbrowser=Webdriver. Firefox () Browser.get ("Https://tieba.baidu.com/index.html") new_list=BROWSER.FIND_ELEMENT_BY_ID (' New_list ') user_name=Browser.find_element_by_name (' user_name ') Active=Browser.find_element_by_class_name (' Active ') p=Browser.find_element_by_tag_name (' P ')# Find_element_by_name Find a single element by name# Find_element_by_xpath Finding individual elements through XPath# Find_element_by_link_text Find a single element through a link# Find_element_by_partial_link_text Find individual elements through partial links# Find_element_by_tag_name Find a single element by tag name# Find_element_by_class_name Find individual elements by class name# Find_element_by_css_selector Select a weapon by CSS to find a single element# Find_elements_by_name find multiple elements by name# Find_elements_by_xpath Find multiple elements through XPath# Find_elements_by_link_text Find multiple elements through links# Find_elements_by_partial_link_text find multiple elements with partial links# Find_elements_by_tag_name find multiple elements by tag name# Find_elements_by_class_name find multiple elements by class name# Find_elements_by_css_selector Select a weapon to find multiple elements via CSS

Get element Information

= browser.find_element_by_id(‘btn_more‘)print(btn_more.get_attribute(‘class‘# 获取属性print(btn_more.get_attribute(‘href‘# 获取属性print# 获取文本值

Element interaction Operations

= browser.find_element_by_id(‘btn_more‘# 模拟点击,可以模拟点击加载更多= browser.find_element(By.ID,‘q‘# 清空输入

Execute JavaScript

# 执行JavaScript脚本browser.execute_script(‘window.scrollTo(0, document.body.scrollHeight)‘)browser.execute_script(‘alert("To Bottom")‘)

Python crawler: Simulate browser behavior with selenium

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More