Compile a simple crawler for a wallpaper website using python and a python Crawler

Source: Internet
Author: User

Compile a simple crawler for a wallpaper website using python and a python Crawler

Target Website: http://www.netbian.com/

Objective: To obtain the first page of wallpapers OF DIFFERENT TYPES

I. Analyze the website and write the code:

(Ps: source code is at the end of the article)

1. Obtain a large part of the code of the website directory, and then carefully match the URL and title.

1 # coding = gbk 2 # objective: to download wallpapers from various directories (large image) 3 _ author _ = 'cqc '4 import urllib2 5 import urllib 6 import re 7 import OS 8 9 # create a wallpaper download folder 10 path = 'd: \ other side wallpaper '11 if not OS. path. isdir (path): 12 OS. makedirs (path) 13 # directory 14 big_title = [] 15 16 # homepage open 17 url = 'HTTP: // www.netbian.com/'18 headers = {'user-agent ': 'mozilla/5.0 (Windows NT 6.2; WOW64; rv: 22.0) Gecko/20100101 Firefox/22.0 '} 19 request = urllib2.Request (url, Headers = headers) 20 response = urllib2.urlopen (request) 21 22 # obtain the source code of the home page Directory 23 pat_menu = re. compile ('<ul class = "menu"> (.*?) </A> </div> ', re. S) 24 code_menu = re. search (pat_menu, response. read ())

 

2. Match the title and link of the category.

1 # directory Title 2 pat_menu_title = re. compile ('<a href = ".*? "Title = "(.*?) "> ', Re. s) 3 menu_title = re. findall (pat_menu_title, code_menu.group (1) 4 for a_item in menu_title: 5 big_title.append (a_item) 6 print a_item 7 8 # directory link 9 pat_menu_link = re. compile ('<a href = "(. *?) "Title = ".*? "> ', Re. S) 10 menu_link = re. findall (pat_menu_link, code_menu.group (1 ))

As shown in:

3. Access the crawled directory to obtain the titles and links of all wallpapers under the directory.

1 # Go to directory 2 j = 0 3 for B _item in menu_link: 4 url_menu = 'HTTP: // www.netbian.com/'+ B _item 5 request_son = urllib2.Request (url_menu, headers = headers) 6 response_son = urllib2.urlopen (request_son) 7 # obtain the image title of each directory, link 8 9 # obtain the subdirectory Title 10 title_son = [] 11 pat_title_son = re. compile (' ', Re. s) 12 res_title = re. findall (pat_title_son, response_son.read () 13 for c_item in res_title: 14 title_son.append (c_item) 15 16 # filter out the subdirectory code 17 pat_code_son = re. compile ('<ul> (. *?) </Ul> ', re. s) 18 middle_pattern = urllib2.Request (url_menu, headers = headers) 19 middle_response = urllib2.urlopen (middle_pattern) 20 res_code_son = re. search (pat_code_son, middle_response.read () 21 22 # obtain the sub-directory link and synthesize the large image webpage link 23 pat_link_son = re. compile ('<li> <a href = "(. *?) "Target =" _ blank "> 

As shown in:

4. Based on the Link crawled in the previous step, synthesize the truly 1080p wallpaper link.

Because after we click in the title, it is like this:

You also need to click the Download button to open the 1080p wallpaper link. For convenience, we directly synthesize the link of the 1080p wallpaper.

Example: http://www.netbian.com/desk/9805.htm

Corresponding 1080p URL: http://www.netbian.com/desk/9805-1920x1080.htm

Code:

1 I = 0 2 # display progress 3 print big_title [j] 4 for d_item in res_link: 5 # obtain the download link for a large image 6 if d_item = 'HTTP: // www.mmmwu.com/': 7 pass 8 else: 9 new_link = 'HTTP: // www.netbian.com/'+ d_item [:-4] + '-1920x1080.htm' 10 print new_link

(Ps: because the first title in the 'beauty' category is linked to another website, I skipped it for simplicity)

5. Go to the 1080p wallpaper link and download the wallpaper.

1 request_real = urllib2.Request (new_link, headers = headers) 2 response_real = urllib2.urlopen (request_real) 3 pat_real = re. compile (' </Td> </tr> ') 4 5 link_real = re. search (pat_real, response_real.read () 6 # Skip vip wallpaper 7 if link_real: 8 fina_link = link_real.group (1) 9 # create a download directory 10 path_final = 'd: \ other side wallpaper \ '+ big_title [j] +' \ '11 if not OS. path. isdir (path_final): 12 OS. makedirs (path_final) 13 path_pic = path_final + title_son [I] + '.jpg '14 f = open (path_pic, 'wb') 15 data = urllib. urlopen (fina_link) 16 f. write (data. read () 17 f. close () 18 if not data: 19 print "Download Failed. "20 I + = 121 print 'one menu download OK. '22 j + = 1

6. Download complete.

Ii. All source code.

1 # coding = gbk 2 # objective: to download wallpapers from various directories (large image) 3 _ author _ = 'cqc '4 import urllib2 5 import urllib 6 import re 7 import OS 8 9 # create a wallpaper download folder 10 path = 'd: \ other side wallpaper '11 if not OS. path. isdir (path): 12 OS. makedirs (path) 13 # directory 14 big_title = [] 15 16 # homepage open 17 url = 'HTTP: // www.netbian.com/'18 headers = {'user-agent ': 'mozilla/5.0 (Windows NT 6.2; WOW64; rv: 22.0) Gecko/20100101 Firefox/22.0 '} 19 request = urllib2.Request (url, Headers = headers) 20 response = urllib2.urlopen (request) 21 22 # obtain the source code of the home page Directory 23 pat_menu = re. compile ('<ul class = "menu"> (.*?) </A> </div> ', re. s) 24 code_menu = re. search (pat_menu, response. read () 25 26 # directory title 27 pat_menu_title = re. compile ('<a href = ". *? "Title = "(.*?) "> ', Re. s) 28 menu_title = re. findall (pat_menu_title, code_menu.group (1) 29 for a_item in menu_title: 30 big_title.append (a_item) 31 print a_item32 33 # directory link 34 pat_menu_link = re. compile ('<a href = "(. *?) "Title = ".*? "> ', Re. s) 35 menu_link = re. findall (pat_menu_link, code_menu.group (1) 36 37 # Enter the directory 38 j = 039 for B _item in menu_link: 40 url_menu = 'HTTP: // www.netbian.com/'+ B _item41 request_son = urllib2.Request (url_menu, headers = headers) 42 response_son = urllib2.urlopen (request_son) 43 # obtain the image title of each directory, link 44 45 # obtain the subdirectory title 46 title_son = [] 47 pat_title_son = re. compile (' ', Re. s) 48 res_title = re. findall (pat_title_son, response_son.read () 49 for c_item in res_title: 50 title_son.append (c_item) 51 52 # filter out the subdirectory code 53 pat_code_son = re. compile ('<ul> (. *?) </Ul> ', re. s) 54 middle_pattern = urllib2.Request (url_menu, headers = headers) 55 middle_response = urllib2.urlopen (middle_pattern) 56 res_code_son = re. search (pat_code_son, middle_response.read () 57 58 # obtain the sub-directory link and synthesize the large image webpage link 59 pat_link_son = re. compile ('<li> <a href = "(. *?) "Target =" _ blank ">  </Td> </tr> ') 74 75 link_real = re. search (pat_real, response_real.read () 76 # Skip vip wallpaper 77 if link_real: 78 fina_link = link_real.group (1) 79 # create a download directory 80 path_final = 'd: \ other side wallpaper \ '+ big_title [j] +' \ '81 if not OS. path. isdir (path_final): 82 OS. makedirs (path_final) 83 path_pic = path_final + title_son [I] + '.jpg '84 f = open (path_pic, 'wb') 85 data = urllib. urlopen (fina_link) 86 f. write (data. read () 87 f. close () 88 if not data: 89 print "Download Failed. "90 I + = 191 print 'one menu download OK. '92 j + = 1

Beginner crawlers. Thank you for your guidance ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.