Talking about python crawlers Using Selenium to simulate browser behavior, pythonselenium

Source: Internet
Author: User

Talking about python crawlers Using Selenium to simulate browser behavior, pythonselenium

A reader asked me a crawler question a few days ago, that is, when I climbed to the popular dynamic pictures on the Baidu Post Bar homepage, The crawled pictures were always incomplete, it is less visible than the homepage. The reason is that the image is dynamically loaded. The problem is how to crawl these Dynamically Loaded Images.

Analysis

His code is relatively simple. The main steps are as follows: Use the BeautifulSoup library, open the home address of Baidu post bar, and then parse the img tag under the id new_list tag, finally, save the image of the img tag.

headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}data=requests.get("https://tieba.baidu.com/index.html",headers=headers)html=BeautifulSoup(data.text,'lxml')

As mentioned above, some images are dynamically loaded. First, we need to figure out how these images are dynamically loaded. Open the homepage of Baidu Post Bar in the browser. You can see that when you scroll down the scroll bar, the scroll bar is shortened and a distance is moved up. This phenomenon is also a manifestation of the dynamic addition of DOM elements to html documents. Dynamic Data loading is nothing more than ajax requests, while ajax is essentially an XMLHttpRequest request (xhr ). In Google browser, we can monitor xhr requests through the network panel of the developer tool.

The xhr request when the homepage is opened. The request here is irrelevant to the image to be crawled.

The scroll bar scroll down 1st times to the bottom. Here, the requested hot news is 20-40, including the image to be crawled.

The scroll bar scroll down 2nd times to the bottom. Here, the request is for the 40-60th hot news, including the image to be crawled. And the returned has_more: false indicates that there is no more data.

The scroll bar scroll down 3rd times to the bottom, no xhr request.

Solution

According to the above analysis, we have learned that when we use BeautifulSoup for crawling, we can only crawl 1-20 images in hot news. In order to crawl the complete picture in the popular dynamic, we need to simulate the scroll bar of the browser to let the webpage trigger xhr requests more popular dynamics.

In python, if you need to simulate browser behavior, you can use the selenium library. The selenium library is an automated testing framework that can be used to simulate various behaviors of the browser. Here we use it to simulate the browser to open the homepage of the Baidu Post Bar and simulate the scroll down to the bottom of the page.

Install

pip install selenium

Download the browser driver

Firefox browser driver, which is: https://github.com/mozilla/geckodriver/releases

Google browser driver, which is: http://chromedriver.storage.googleapis.com/index.html? Route = 2.33/

Operabrowser driver, which is: https://github.com/operasoftware/operachromiumdriver/releases

Download the driver file from the above address, or download the above drivers from my github Project (Address: https://github.com/Sesshoumaru/attachments/tree/master/Selenium%20WebDriver) in comparison to the browser installed on your computer and the corresponding version ). After downloading and decompressing the package, add the directory to the environment variable of the system. Of course, you can also put the downloaded driver in the lib directory of the python installation directory, because it already exists in the environment variable (I am doing this ).

Simulate browser behavior using python code

To use selenium, you first need to define a specific browser object. Here we will define the specific browser installed on your computer and the driver of the installed browser. The following uses Firefox as an example:

from selenium import webdriverbrowser = webdriver.Firefox()

Open the homepage again:

browser.get(https://tieba.baidu.com/index.html)

Then simulate scroll to the bottom

for i in range(1, 5): browser.execute_script('window.scrollTo(0, document.body.scrollHeight)') time.sleep(1)

Finally, use BeautifulSoup to parse the Image Tag:

html = BeautifulSoup(browser.page_source, "lxml")imgs = html.select("#new_list li img")

Notes

The browser and browser driver must be installed and must be configured

That is, if you use Google Chrome to simulate web page behavior, you need to download the Google Chrome driver;
If you use Firefox to simulate web page behavior, you need to download Firefox driver

The directory where the browser driver is located should be in the environment variable, or the path of the driver should be specified when the browser is defined.

Selenium

Search Element

From selenium import webdriverbrowser = webdriver. firefox () browser. get ("https://tieba.baidu.com/index.html") new_list = browser. find_element_by_id ('new _ list') user_name = browser. find_element_by_name ('user _ name') active = browser. find_element_by_class_name ('active') p = browser. find_element_by_tag_name ('P ') # find_element_by_name search for a single element by name # find_element_by_xpath search for a single element by xpath # locate a single element by link # find_element_by_tag_name search for a single element by Tag name search for a single element by class name # locate search for a single element by selecting a weapon through css # find_elements_by_name search for Multiple Elements by name # find_elements_by_xpath search for Multiple Elements by xpath # find_elements_by_link_text search for Multiple Elements by link # Seek search for multiple elements through some links # find_elements_by_tag_name search for Multiple Elements by Tag name # find_elements_by_class_name search for Multiple Elements by class name # find_elements_by_css_selector search for multiple elements through css selection weapons

Get Element Information

Btn_more = browser. find_element_by_id ('btn _ more') print (btn_more.get_attribute ('class') # Get the attribute print (btn_more.get_attribute ('href ') # Get the print (btn_more.text) # Get the text value

Element Interaction

Btn_more = browser. find_element_by_id ('btn _ more') btn_more.click () # simulate a click to load more input_search = browser. find_element (. ID, 'q') input_search.clear () # clear input

Execute JavaScript

# Execute the javascriptscript browser.exe cute_script ('window. scrollTo (0, document.body.scrollheight+'browser.exe cute_script ('alert ("To Bottom ")')

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.