[Python crawler] Selenium targeted crawling of huge numbers of exquisite pictures on tigers and basketball,

Last Update:2015-12-18 Source: Internet

Author: User

Tags getscript

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Python crawler] Selenium targeted crawling of huge numbers of exquisite pictures on tigers and basketball,
Preface:

As a fan who watches basketball from an early age, he will often go to forums such as tigers, basketball and wet pages. There will be a lot of exquisite pictures in the Forum, including NBA teams, CBA stars, lace news, shoes and beautiful women, etc. If one piece of right-click and save it as one, it really hurts. Write a program as a programmer!
Therefore, I use Python + Selenium + Regular Expression + urllib2 to crawl Massive images.
I have talked about too many articles related to Python crawlers, such as crawling Sina blogs, Wikipedia Infobox, Baidu encyclopedia, and youxun.com images, as well as Selenium installation process. For details, see my two columns:
Python Learning Series
Python crawler-Selenium + Phantomjs + CasperJS

Running effect:

Shows the running effect. The first figure is the gallery where the tag crawled by the Tigers website is the Spurs, and the second figure is the gallery where the tag is Chen Lu. Each folder name corresponds to a webpage topic, and the images are complete.
Http://photo.hupu.com/nba/tag/spurs
Http://photo.hupu.com/nba/tag/chen Lu

Source code:

1 #-*-coding: UTF-8-*-2 "3 Crawling pictures by selenium and urllib 4 url: Tigers spurs http://photo.hupu.com/nba/tag/%E9%A9%AC%E5%88%BA 5 url: tiger flutter Chen Lu http://photo.hupu.com/nba/tag/%E9%99%88%E9%9C%B2 6 Created on 7 @ author: eastmount CSDN 8 "9 10 import time 11 import re 12 import OS 13 import sys 14 import urllib 15 import shutil 16 import datetime 17 from selenium import webdriver 18 From selenium. webdriver. common. keys import Keys 19 import selenium. webdriver. support. ui as ui 20 from selenium. webdriver. common. action_chains import ActionChains 21 22 # Open PhantomJS 23 driver = webdriver. phantomJS (executable_path = "G: \ phantomjs-1.9.1-windows \ phantomjs.exe") 24 # driver = webdriver. firefox () 25 wait = ui. webDriverWait (driver, 10) 26 27 # Download one Picture By urllib 28 def loadPic Ture (pic_url, pic_path): 29 pic_name = OS. path. basename (pic_url) # obtain the image name from the delete path 30 pic_name = pic_name.replace ('*', '') # Remove '*' to prevent invalid mode ('wb ') Errors ') or filename 31 urllib. urlretrieve (pic_url, pic_path + pic_name) 32 33 34 # crawl a specific image and the next 35 def getScript (elem_url, path, nums): 36 try: 37 # Since the link http://photo.hupu.com/nba/p29556-1.html 38 # Just splice http ://.... /p29556-number .html omitted automatic click "Next" Operation 39 count = 1 40 t = Elem_url.find(r'.html ') 41 while (count <= nums): 42 html_url = elem_url [: t] +'-'+ str (count) + '.html '43 # print html_url 44 ''' 45 driver_pic.get (html_url) 46 elem = driver_pic.find_element_by_xpath ("// div [@ class = 'pic _ bg ']/div/img") 47 url = elem. get_attribute ("src") 48 ''' 49 # use a regular expression to obtain 3rd <div> </div> and then obtain the image URL to download 50 content = urllib. urlopen (html_url ). read () 51 start = content. find (R' <div cl Ass = "flTab"> ') 52 end = content. find (R' <div class = "comMark" style> ') 53 content = content [start: end] 54 div_pat = R' <div. *?> (.*?) <\/Div> '55 div_m = re. findall (div_pat, content, re. S | re. m) 56 # print div_m [2] 57 link_list = re. findall (r "(? <= Href = \ "). +? (? = \ ") | (? <= Href = \ '). +? (? = \ ') ", Div_m [2]) 58 # print link_list 59 url = link_list [0] # Only One url is linked to 60 loadPicture (url, path) 61 count = count + 1 62 63 blocks t Exception, e: 64 print 'error: ', e 65 finally: 66 print 'Download' + str (count) + 'pictures \ n' 67 68 69 # crawl the URL of the home page image set and theme 70 def getTitle (url): 71 try: 72 # crawl URL and title 73 count = 0 74 print 'function getTitle (key, url) '75 driver. get (url) 76 wait. until (lambda driver: driver. fi Nd_element_by_xpath ("// div [@ class = 'piclist3']") 77 print 'title: '+ driver. title + '\ n' 78 79 # thumbnail slice url (useless here) number of images title (File Name) Attention sequence 80 elem_url = driver. find_elements_by_xpath ("// a [@ class = 'ku ']/img") 81 elem_num = driver. find_elements_by_xpath ("// div [@ class = 'piclist3']/table/tbody/tr/td/dl/dd [1]") 82 elem_title = driver. find_elements_by_xpath ("// div [@ class = 'piclist3']/table/tbody/tr/td/dl/dt/a") 83 f Or url in elem_url: 84 pic_url = url. get_attribute ("src") 85 html_url = elem_title [count]. get_attribute ("href") 86 print elem_title [count]. text 87 print html_url 88 print pic_url 89 print elem_num [count]. text 90 91 # create an image folder 92 path = "E: \ Picture_HP \" + elem_title [count]. text + "\" 93 m = re. findall (R' (\ w * [0-9] +) \ w * ', elem_num [count]. text) # Number of crawlers 94 nums = int (m [0]) 95 count = count + 1 96 if o S. path. isfile (path): # Delete file 97 OS. remove (path) 98 elif OS. path. isdir (path): # Delete dir 99 shutil. rmtree (path, True) 100 OS. makedirs (path) # create the file directory 101 getScript (html_url, path, nums) # visit pages102 103 failed t Exception, e: 104 print 'error: ', e 105 finally: 106 print 'Find '+ str (count) + 'pages with key \ n' 107 108 # Enter Function 109 def main (): 110 # Create Folder 111 ba SePathDirectory = "E: \ Picture_HP" 112 if not OS. path. exists (basePathDirectory): 113 OS. makedirs (basePathDirectory) 114 115 # Input the Key for search str => unicode => UTF-8 116 key = raw_input ("Please input a key :"). decode (sys. stdin. encoding) 117 print 'the key is: '+ key 118 119 # Set URL List Sum: 1-2 Pages 120 print 'ready to start The Download !!! \ N \ n' 121 starttime = datetime. datetime. now () 122 num = 1 123 while num <= # url = 'HTTP: // photo.hupu.com/nba/tag/%E9%99%88%E9%9C%B2? P = 2 & o = 1' 125 url = 'HTTP: // photo.hupu.com/nba/tag/%e9%a9%ac%e5%88%ba'126 print '+ str (num) + 'page', 'url: '+ url 127 # Determine whether the title contains key 128 getTitle (url) 129 time. sleep (2) 130 num = num + 1 131 else: 132 print 'Download Over !!! '1970 133 # get the runtime 134 endtime = datetime. datetime. now () 135 print 'The Running time: ', (endtime-starttime). seconds 136 137 main ()

Code parsing:

The main steps of the source program are as follows:
1. In the main function of the portal, create the Picture_HP image folder under the drive E, and enter the gallery url. In this case, enter the tag to access the file, because the URL is as follows:
Http://photo.hupu.com/nba/tag/spurs
But it is always wrong to parse the URL in Chinese, so it is changed to the input URL, which does not affect the overall situation. At the same time, you may find that the while LOOP condition in the Code is num <= 1, which is executed only once. We recommend that you assign a value to the URL for the gallery to be downloaded. However, different page links of tigers are as follows. You can obtain all pages cyclically through URL splicing.
Http://photo.hupu.com/nba/tag/%E9%99%88%E9%9C%B2? P = 2 & o = 1

2. Call the getTitle (rul) function to analyze the html dom structure through Selenium and Phantomjs. Use the find_elements_by_xpath function to obtain the URL of the source image path, the topic of the gallery, and the number of images.

Summary:
This is an article about the tiger crawling Gallery of Selenium and Python. The content of this article is relatively basic in crawlers, the downloaded "Chen Lu" images are the same as the 34 Atlas images and 902 images provided by the website. At the same time, the time after the regular expression is estimated to be about 3 minutes, very soon ~ Of course, there are many tags in the tigers, and football should be similar. You only need to modify the URL to download the gallery, which is very convenient.
Recently, I have been learning more extensive Python crawling for Spider and want to learn distributed crawlers and docker. I hope I will have the opportunity to talk about how to implement deep search crawling and wide search crawling. Of course, if you are a beginner of crawlers or a beginner of Python, these practices will be helpful to you ~
Finally, I hope my friends here can get something. If there are any errors or deficiencies, please try again ~ Recently, I am studying hard and expect to become a college teacher. I am ignorant, optimistic, humble, and low-key.
(By: Eastmount http://blog.csdn.net/eastmount)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More