A: Preface
Beep, please swipe your car. Yesterday saw a good picture sharing network-petals, the picture quality is also good, so the use of Selenium+xpath I took its sister's column under the crawl down, to the picture column name to the folder naming classification saved to the computer. This sister's homepage is dynamically loaded, and if you want to get more content you can simulate a drop-down so you can have more picture resources. This has been done before, but because the speed is not fast enough so I grabbed 19 columns, a total of more than 500 pictures, has been very satisfied.
First look at the effect:
Paste_image.png
Paste_image.png
Second: Operating Environment
Ide:pycharm
Python3.6
lxml 3.7.2
Selenium 3.4.0
Requests 2.12.4
Three: Example analysis
1. This crawler I started to do the idea is: enter this page and then to get all the image column corresponding URL, and then go to each page to get all the pictures. (as shown)
Paste_image.png
Paste_image.png
2. But crawl gets the picture resolution is 236x354, the picture quality is not high enough, but that time is already 1 o'clock 30 after the night, so the next day to do another version: on this basis to enter each thumbnail corresponding to the page, and then crawl like the following high-definition image.
Paste_image.png
Four: Actual combat code
1. The first step to import the required modules of the crawler
__author__ = ' cloth cluck _rieuse ' from selenium.webdriver.common.by import byfrom selenium.webdriver.support import expected_ Conditions as Ecfrom selenium.webdriver.support.ui import webdriverwaitfrom Selenium import Webdriverimport Requestsimport lxml.htmlimport OS
2. The following is the type of setting Webdriver, is the use of what browser to simulate, you can use Firefox to see its simulation process, can also be a headless browser phantomjs to quickly get resources, ['--load-images=false ', '-- Disk-cache=true ') This means that the image and the cache are not loaded when the simulation is browsed, so the speed of the operation will be faster. Webdriverwait indicates the maximum wait for the browser to load for 10 seconds, Set_window_size can set the size of the simulated browsing Web page. Some sites if the size is not in place, then some resources will not be loaded out.
# Service_args = ['--load-images=false ', '--disk-cache=true ']# browser = webdriver. PHANTOMJS (service_args=service_args) browser = Webdriver. Firefox () wait = webdriverwait (browser, ten) browser.set_window_size (1400, 900)
3.parser (URL, param) This function is used to parse the Web page, which is used several times later, so writing a function directly will make the code look neater and orderly. The function has two parameters: one is the URL, the other is the dominant waiting for the part, this can be a page in some sections, buttons, pictures and so on ...
def parser (URL, param): browser.get (URL) wait.until (ec.presence_of_element_located (By.css_selector, param))) html = browser.page_source doc = lxml.html.fromstring (html) return doc
4. The following code is to parse the main page and then get to each column of the URL and column name, use XPath to get the page of the column, enter the Web Developer mode, after the operation. After the need to use the column name in the computer to create a folder, so in this page to get the name of the column, here encountered a problem, some names do not conform to the file naming rules to be excluded, I here is a * impact.
Def get_main_url (): print (' Open homepage search link ... ') try: doc = parser (' http://huaban.com/boards/favorite/beauty/', ' # Waterfall ') name = Doc.xpath ('//*[@id = "Waterfall"]/div/a[1]/div[2]/h3/text () ') u = Doc.xpath ('//*[@id = " Waterfall "]/div/a[1]/@href ') for item, fileName in Zip (U, name): main_url = ' http://huaban.com ' + Item print (' Main link found ' + main_url ' if ' * ' in fileName: filename = filename.replace (' * ', ' ') download (Main_url, fileName) Except Exception as E: print (e)
Paste_image.png
5. The front has been obtained to the column of the page and the name of the column, here needs to the column page analysis, into the column page, just some thumbnails, we do not want these low-resolution pictures, so to enter each thumbnail, parse the page to get to the real high-definition image URL. Here is also a place to compare deceptive, is a column, different pictures stored in the DOM format is not the same, so I do
Img_url = Doc.xpath ('//*[@id = "Baidu_image_holder"]/a/img/@src ') Img_url2 = Doc.xpath ('//*[@id = "Baidu_image_holder"] /img/@src ')
This takes both of the image addresses in the DOM format and merges the two address list.img_url +=img_url2
Create a folder locally using the filename = 'image\{}\'.format(fileName) + str(i) + '.jpg' presentation file that is saved in the same directory as the reptile code image, and then get the picture saved in the image in the folder that was previously obtained by the column name.
def download (Main_url, fileName): Print ('-------ready to download-------') Try:doc = parser (main_url, ' #waterfall ') if not OS . path.exists (' image\\ ' + filename): print (' Create Folder ... ') os.makedirs (' image\\ ' + filename) Link = Doc.xpath ('//*[@id = "Waterfall"]/div/a/@href ') # print (link) i = 0for item in Link:i + 1 mi Nor_url = ' http://huaban.com ' + Item doc = parser (minor_url, ' #pin_view_page ') Img_url = Doc.xpath (' *[@id = "Baidu_image_holder"]/a/img/@src ') Img_url2 = Doc.xpath ('//*[@id = "Baidu_image_holder"]/img/@src ') Img_url +=img_url2try:url = ' http: ' + str (img_url[0]) print (' downloading section ' + str (i) + ' picture, ground Address: ' + URL ') r = requests.get (URL) filename = ' image\\{}\\ '. Format (filename) + str (i) + '. jpg ' with open (filename, ' WB ') as Fo:fo.write (r.content) except Exception:print (' ERROR! ') except Exception:prinT (' wrong! ') if __name__ = = ' __main__ ': Get_main_url ()
V: summary
The crawler continued to practice the use of selenium and XPath, in the Web page analysis also encountered a lot of problems, only to keep practicing can not be part of the reduction, of course, this climb took more than 500 sister paper is pretty eye-pleasing.