How to Write a Python crawler to capture TOP100 Douban movies and user portraits

Last Update:2018-07-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces how to write Python crawlers to capture the TOP100 Douban movies and user portraits. The urllib and urllib2 modules of Python are used. For more information, see Capture Douban movies TOP100
I. Analyze the Douban top page and build the program structure
1. First open the Web http://movie.douban.com/top250? Start, that is, top page
Click the top100 page. Note that the links with the top100 are

http://movie.douban.com/top250?start=0http://movie.douban.com/top250?start=25http://movie.douban.com/top250?start=50http://movie.douban.com/top250?start=75

2. Check the source code and find that the code for the movie name is as follows:

Shawshank salvation
/The Shawshank Redemption
Because there are some English names and other descriptions, some interference may occur when capturing through regular expressions, and further filtering may be required.

Based on the above information, this program mainly consists of the following three steps:

2. Construct a url address pool

Capture top100 movie names
Print Output in sequence

Write code in sequence

1. Construct a url address pool. The Code is as follows:

Import urllib2import re # ---------- determine the url address pool ------------ pre_url = 'HTTP: // movie.douban.com/top250? Start = 'top _ urls = [] # Because of top100, there are 25 movies on each page, so there are 4 pages, starting from scratch for num in range (4 ): top_urls.append (pre_url + str (num * 25 ))

2. Capture the top100 movie name

# ------------ Capture the top100 movie name ---------- top_content = [] top_tag = re. compile (R' (. + ?) ') For url in top_urls: content = urllib2.urlopen (url ). read () pre_content = re. findall (top_tag, content) # filter the list that does not meet the conditions and obtain the final list for item in pre_content: if item. find ('') =-1: top_content.append (item)

3. Print the output

top_num = 1for item in top_content:  print 'Top' + str(top_num) + '  ' + item  top_num += 1

Iii. Organize code
I am a beginner in python. I don't have many pythonic ideas, and I don't have any code optimization skills.
Secondly, I am used to using less functions in simple code and try not to hide the code logic.
Please refer to the following code and welcome your comments. Thank you for your comments!
The Code is as follows:

# Coding = UTF-8 ''' this code automatically captures Douban top100 movie code @ pre_url url prefix, here for the http://movie.douban.com/top250? Start = @ top_urls url address pool @ top_tag is the regular expression '''import urllib2import repre_url = 'HTTP: // movie.douban.com/top250? Start = 'top _ urls = [] top_tag = re. compile (R' (. + ?) ') Top_content = [] top_num = 1 # ---------- determine the url address pool ------------ # Because of top100, there are 25 movies on each page, so there are 4 pages, starting from scratch for num in range (4): top_urls.append (pre_url + str (num * 25) # ------------ capture the top100 movie name and print the output ---------- top_tag = re. compile (R '(. + ?) ') For url in top_urls: content = urllib2.urlopen (url ). read () pre_content = re. findall (top_tag, content) # filter and print the output for item in pre_content: if item. find ('') =-1: print 'top' + str (top_num) +'' + item top_num + = 1

Capture user profile pictures

Import urllib. requestimport reimport time # obtain the input post single page htmldef getHtml2 (url2): html2 = urllib. request. urlopen (url2 ). read (). decode ('utf-8') return html2 # extract the image list and download the image def gettopic (html2): reg2 = R' http://www.douban.com/group/topic/ \ D + 'topiclist = re. findall (reg2, html2) x = 0 # limit the number of downloaded images for topicurl in topiclist: x + = 1 return topicurl # download an image to a local def download (topic_page ): reg3 = R' http://img3.douban.com/view/group_topic/large/public/.+ \. Jpg 'imglist = re. findall (reg3, topic_page) I = 1 download_img = None for imgurl in imglist: # obtain the image ID as the file name img_numlist = re. findall (r 'P \ d {7} ', imgurl) for img_num in img_numlist: download_img = urllib. request. urlretrieve (imgurl, 'd: \ python \ code \ girls \ %s.jpg '% img_num) time. sleep (1) I + = 1 print (imgurl) return download_img # Call the function page_end = int (input ('Enter the page number at the end :')) num_end = page_end * 25num = 0page_num = 1 while num <= num_end: html2 = getHtml2 (' http://www.douban.com/group/kaopulove/discussion?start=%d '% Num) topicurl = gettopic (html2) topic_page = getHtml2 (topicurl) download_img = download (topic_page) num = page_num * 25 page_num + = 1 else: print ('collection completed! ')

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to Write a Python crawler to capture TOP100 Douban movies and user portraits

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to Write a Python crawler to capture TOP100 Douban movies and user portraits

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support