Python crawler 3--Get review elements (board wild Friends beauty Bar picture download)

Source: Internet
Author: User



Test environment: python2.7 + beautifulsoup4.4.1 + selenium2.48.0



Test URL: http://tieba.baidu.com/p/2827883128



The goal is to download all the images under this page, a total of 160 + sheets. Can be divided into the following steps:



1. Get the source code of the webpage.



Found directly through URLLIB2 or request to obtain the source code and the actual picture does not correspond, through the chrome review element function can query the corresponding picture, guess is that the page image loading is asynchronous transmission through Ajax. So use the tool selenium + chromedriver implementation. Once the Selenium+chromedriver is installed, you can get the source code you need.



2, analyze the source code obtained, find the actual address of the picture and then download. The process is similar to the Python crawler 2--download file. Before the analysis of the source code are directly through regular expression implementation, it is recommended to learn the next BeautifulSoup (now go), more convenient.



After the actual program was run, it was found that only 40 images were obtained. The reason is that after the page is loaded, there are only 40 pictures. If you need to get all the pictures, you need to manually swipe down the wheel before the page loads, and then the browser continues to send AJAX requests to the server for additional images. The method is practical and can get all the pictures.



But! Manual operation This method is too low! Here is my guess: 1, can directly through the analysis of the source code in the JS section, directly extracted all the back to the server to send Ajax request code, one-time send out to get all the image address? 2, whether can through JS or selenium, in the loading page time period, simulate the wheel action, so as to achieve the function. Method 2 I actually tested the next, because the level is limited, not familiar with JS, no success. Attached code:


 
 
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
url = "http://tieba.baidu.com/p/2827883128"
driver.get(url)
try:
    # driver.implicitly_wait(20)
    # driver.find_element_by_id("ag_main_bottombar")
    # js="var q=document.body.scrollTop=10000"
    # driver.execute_script(js)
    sourcePage = driver.page_source
    soup = BeautifulSoup(sourcePage, "lxml")
    images = soup.find_all(class_ = "ag_ele_a ag_ele_a_v")
    print(len(images))
    for image in images:
        print(image)
finally:
    # pass
    driver.quit()




Python crawler 3--Get review elements (board wild Friends beauty Bar picture download)


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.