Compile a simple crawler for a wallpaper website using python and a python Crawler
Target Website: http://www.netbian.com/
Objective: To obtain the first page of wallpapers OF DIFFERENT TYPES
I. Analyze the website and write the code:
(Ps: source code is at the end of the article)
1. Obtain a large part of the code of the website directory, and then carefully match the URL and title.
1 # coding = gbk 2 # objective: to download wallpapers from various directories (large image) 3 _ author _ = 'cqc '4 import urllib2 5 import urllib 6 import re 7 import OS 8 9 # create a wallpaper download folder 10 path = 'd: \ other side wallpaper '11 if not OS. path. isdir (path): 12 OS. makedirs (path) 13 # directory 14 big_title = [] 15 16 # homepage open 17 url = 'HTTP: // www.netbian.com/'18 headers = {'user-agent ': 'mozilla/5.0 (Windows NT 6.2; WOW64; rv: 22.0) Gecko/20100101 Firefox/22.0 '} 19 request = urllib2.Request (url, Headers = headers) 20 response = urllib2.urlopen (request) 21 22 # obtain the source code of the home page Directory 23 pat_menu = re. compile ('<ul class = "menu"> (.*?) </A> </div> ', re. S) 24 code_menu = re. search (pat_menu, response. read ())
2. Match the title and link of the category.
1 # directory Title 2 pat_menu_title = re. compile ('<a href = ".*? "Title = "(.*?) "> ', Re. s) 3 menu_title = re. findall (pat_menu_title, code_menu.group (1) 4 for a_item in menu_title: 5 big_title.append (a_item) 6 print a_item 7 8 # directory link 9 pat_menu_link = re. compile ('<a href = "(. *?) "Title = ".*? "> ', Re. S) 10 menu_link = re. findall (pat_menu_link, code_menu.group (1 ))
As shown in:
3. Access the crawled directory to obtain the titles and links of all wallpapers under the directory.
1 # Go to directory 2 j = 0 3 for B _item in menu_link: 4 url_menu = 'HTTP: // www.netbian.com/'+ B _item 5 request_son = urllib2.Request (url_menu, headers = headers) 6 response_son = urllib2.urlopen (request_son) 7 # obtain the image title of each directory, link 8 9 # obtain the subdirectory Title 10 title_son = [] 11 pat_title_son = re. compile (' ', Re. s) 12 res_title = re. findall (pat_title_son, response_son.read () 13 for c_item in res_title: 14 title_son.append (c_item) 15 16 # filter out the subdirectory code 17 pat_code_son = re. compile ('<ul> (. *?) </Ul> ', re. s) 18 middle_pattern = urllib2.Request (url_menu, headers = headers) 19 middle_response = urllib2.urlopen (middle_pattern) 20 res_code_son = re. search (pat_code_son, middle_response.read () 21 22 # obtain the sub-directory link and synthesize the large image webpage link 23 pat_link_son = re. compile ('<li> <a href = "(. *?) "Target =" _ blank ">
As shown in:
4. Based on the Link crawled in the previous step, synthesize the truly 1080p wallpaper link.
Because after we click in the title, it is like this:
You also need to click the Download button to open the 1080p wallpaper link. For convenience, we directly synthesize the link of the 1080p wallpaper.
Example: http://www.netbian.com/desk/9805.htm
Corresponding 1080p URL: http://www.netbian.com/desk/9805-1920x1080.htm
Code:
1 I = 0 2 # display progress 3 print big_title [j] 4 for d_item in res_link: 5 # obtain the download link for a large image 6 if d_item = 'HTTP: // www.mmmwu.com/': 7 pass 8 else: 9 new_link = 'HTTP: // www.netbian.com/'+ d_item [:-4] + '-1920x1080.htm' 10 print new_link
(Ps: because the first title in the 'beauty' category is linked to another website, I skipped it for simplicity)
5. Go to the 1080p wallpaper link and download the wallpaper.
1 request_real = urllib2.Request (new_link, headers = headers) 2 response_real = urllib2.urlopen (request_real) 3 pat_real = re. compile (' </Td> </tr> ') 4 5 link_real = re. search (pat_real, response_real.read () 6 # Skip vip wallpaper 7 if link_real: 8 fina_link = link_real.group (1) 9 # create a download directory 10 path_final = 'd: \ other side wallpaper \ '+ big_title [j] +' \ '11 if not OS. path. isdir (path_final): 12 OS. makedirs (path_final) 13 path_pic = path_final + title_son [I] + '.jpg '14 f = open (path_pic, 'wb') 15 data = urllib. urlopen (fina_link) 16 f. write (data. read () 17 f. close () 18 if not data: 19 print "Download Failed. "20 I + = 121 print 'one menu download OK. '22 j + = 1
6. Download complete.
Ii. All source code.
1 # coding = gbk 2 # objective: to download wallpapers from various directories (large image) 3 _ author _ = 'cqc '4 import urllib2 5 import urllib 6 import re 7 import OS 8 9 # create a wallpaper download folder 10 path = 'd: \ other side wallpaper '11 if not OS. path. isdir (path): 12 OS. makedirs (path) 13 # directory 14 big_title = [] 15 16 # homepage open 17 url = 'HTTP: // www.netbian.com/'18 headers = {'user-agent ': 'mozilla/5.0 (Windows NT 6.2; WOW64; rv: 22.0) Gecko/20100101 Firefox/22.0 '} 19 request = urllib2.Request (url, Headers = headers) 20 response = urllib2.urlopen (request) 21 22 # obtain the source code of the home page Directory 23 pat_menu = re. compile ('<ul class = "menu"> (.*?) </A> </div> ', re. s) 24 code_menu = re. search (pat_menu, response. read () 25 26 # directory title 27 pat_menu_title = re. compile ('<a href = ". *? "Title = "(.*?) "> ', Re. s) 28 menu_title = re. findall (pat_menu_title, code_menu.group (1) 29 for a_item in menu_title: 30 big_title.append (a_item) 31 print a_item32 33 # directory link 34 pat_menu_link = re. compile ('<a href = "(. *?) "Title = ".*? "> ', Re. s) 35 menu_link = re. findall (pat_menu_link, code_menu.group (1) 36 37 # Enter the directory 38 j = 039 for B _item in menu_link: 40 url_menu = 'HTTP: // www.netbian.com/'+ B _item41 request_son = urllib2.Request (url_menu, headers = headers) 42 response_son = urllib2.urlopen (request_son) 43 # obtain the image title of each directory, link 44 45 # obtain the subdirectory title 46 title_son = [] 47 pat_title_son = re. compile (' ', Re. s) 48 res_title = re. findall (pat_title_son, response_son.read () 49 for c_item in res_title: 50 title_son.append (c_item) 51 52 # filter out the subdirectory code 53 pat_code_son = re. compile ('<ul> (. *?) </Ul> ', re. s) 54 middle_pattern = urllib2.Request (url_menu, headers = headers) 55 middle_response = urllib2.urlopen (middle_pattern) 56 res_code_son = re. search (pat_code_son, middle_response.read () 57 58 # obtain the sub-directory link and synthesize the large image webpage link 59 pat_link_son = re. compile ('<li> <a href = "(. *?) "Target =" _ blank "> </Td> </tr> ') 74 75 link_real = re. search (pat_real, response_real.read () 76 # Skip vip wallpaper 77 if link_real: 78 fina_link = link_real.group (1) 79 # create a download directory 80 path_final = 'd: \ other side wallpaper \ '+ big_title [j] +' \ '81 if not OS. path. isdir (path_final): 82 OS. makedirs (path_final) 83 path_pic = path_final + title_son [I] + '.jpg '84 f = open (path_pic, 'wb') 85 data = urllib. urlopen (fina_link) 86 f. write (data. read () 87 f. close () 88 if not data: 89 print "Download Failed. "90 I + = 191 print 'one menu download OK. '92 j + = 1
Beginner crawlers. Thank you for your guidance ~