How to Write a Python crawler to capture TOP100 Douban movies and user portraits

Source: Internet
Author: User
This article mainly introduces how to write Python crawlers to capture the TOP100 Douban movies and user portraits. The urllib and urllib2 modules of Python are used. For more information, see Capture Douban movies TOP100
I. Analyze the Douban top page and build the program structure
1. First open the Web http://movie.douban.com/top250? Start, that is, top page
Click the top100 page. Note that the links with the top100 are

http://movie.douban.com/top250?start=0http://movie.douban.com/top250?start=25http://movie.douban.com/top250?start=50http://movie.douban.com/top250?start=75

2. Check the source code and find that the code for the movie name is as follows:

Shawshank salvation
/The Shawshank Redemption
Because there are some English names and other descriptions, some interference may occur when capturing through regular expressions, and further filtering may be required.

Based on the above information, this program mainly consists of the following three steps:

2. Construct a url address pool

  • Capture top100 movie names
  • Print Output in sequence

Write code in sequence

1. Construct a url address pool. The Code is as follows:

Import urllib2import re # ---------- determine the url address pool ------------ pre_url = 'HTTP: // movie.douban.com/top250? Start = 'top _ urls = [] # Because of top100, there are 25 movies on each page, so there are 4 pages, starting from scratch for num in range (4 ): top_urls.append (pre_url + str (num * 25 ))

2. Capture the top100 movie name

# ------------ Capture the top100 movie name ---------- top_content = [] top_tag = re. compile (R' (. + ?) ') For url in top_urls: content = urllib2.urlopen (url ). read () pre_content = re. findall (top_tag, content) # filter the list that does not meet the conditions and obtain the final list for item in pre_content: if item. find ('') =-1: top_content.append (item)

3. Print the output

top_num = 1for item in top_content:  print 'Top' + str(top_num) + '  ' + item  top_num += 1

Iii. Organize code
I am a beginner in python. I don't have many pythonic ideas, and I don't have any code optimization skills.
Secondly, I am used to using less functions in simple code and try not to hide the code logic.
Please refer to the following code and welcome your comments. Thank you for your comments!
The Code is as follows:

# Coding = UTF-8 ''' this code automatically captures Douban top100 movie code @ pre_url url prefix, here for the http://movie.douban.com/top250? Start = @ top_urls url address pool @ top_tag is the regular expression '''import urllib2import repre_url = 'HTTP: // movie.douban.com/top250? Start = 'top _ urls = [] top_tag = re. compile (R' (. + ?) ') Top_content = [] top_num = 1 # ---------- determine the url address pool ------------ # Because of top100, there are 25 movies on each page, so there are 4 pages, starting from scratch for num in range (4): top_urls.append (pre_url + str (num * 25) # ------------ capture the top100 movie name and print the output ---------- top_tag = re. compile (R '(. + ?) ') For url in top_urls: content = urllib2.urlopen (url ). read () pre_content = re. findall (top_tag, content) # filter and print the output for item in pre_content: if item. find ('') =-1: print 'top' + str (top_num) +'' + item top_num + = 1

Capture user profile pictures

Import urllib. requestimport reimport time # obtain the input post single page htmldef getHtml2 (url2): html2 = urllib. request. urlopen (url2 ). read (). decode ('utf-8') return html2 # extract the image list and download the image def gettopic (html2): reg2 = R' http://www.douban.com/group/topic/ \ D + 'topiclist = re. findall (reg2, html2) x = 0 # limit the number of downloaded images for topicurl in topiclist: x + = 1 return topicurl # download an image to a local def download (topic_page ): reg3 = R' http://img3.douban.com/view/group_topic/large/public/.+ \. Jpg 'imglist = re. findall (reg3, topic_page) I = 1 download_img = None for imgurl in imglist: # obtain the image ID as the file name img_numlist = re. findall (r 'P \ d {7} ', imgurl) for img_num in img_numlist: download_img = urllib. request. urlretrieve (imgurl, 'd: \ python \ code \ girls \ %s.jpg '% img_num) time. sleep (1) I + = 1 print (imgurl) return download_img # Call the function page_end = int (input ('Enter the page number at the end :')) num_end = page_end * 25num = 0page_num = 1 while num <= num_end: html2 = getHtml2 (' http://www.douban.com/group/kaopulove/discussion?start=%d '% Num) topicurl = gettopic (html2) topic_page = getHtml2 (topicurl) download_img = download (topic_page) num = page_num * 25 page_num + = 1 else: print ('collection completed! ')

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.