[Python] obtain a list of users who have watched this movie in batches from Douban movies. python user list

Last Update:2015-10-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Python] obtain a list of users who have watched this movie in batches from Douban movies. python user list
Preface

Since a later experiment requires a large amount of movie data from Douban users, I thought of getting more active Douban movie users from the Douban movie "Douban members who have watched this movie" page.

Link Analysis

This isI have seen Douban members of "Imitation Games"Webpage link:http://movie.douban.com/subject/10463953/collections.

One page shows 20 Douban users who have watched the movie. When the next page is clicked, the current connection changes:http://movie.douban.com/subject/10463953/collections?start=20.

Therefore, when the next page is requested, the index after "start" is actually increased by 20.

Therefore, we can setbase_url='http://movie.douban.com/subject/10463953/collections?start=',i=range(0,200,20), In the loopurl=base_url+str(i).

The reason for setting the maximum I value to 180 was that after testing, Douban only provided the most recent 200 users who saw a movie.

Read webpage

During access, I set up an HTTP proxy. In order to prevent the access frequency from getting too frequent, the ip address is blocked by Douban. Every time I read a webpage, I calltime.sleep(5)Wait for 5 seconds. When the program is running, do other things well.

Web page resolution

This time, we use the BeautifulSoup library to parse html.
Every user information in html is as follows:

<Table width = "100%" class = ""> <tr> <td width = "80" valign = "top"> <a href = "http://movie.douban.com/people/46770381/">  </a> </td> <td valign = "top"> <div class = "pl2"> 
 
First, initialize with the read htmlsoup=BeautifulSoup(html). The required information is only the user ID and the user's movie homepage. Therefore, the truly useful information is in this Code:
<Td width = "80" valign = "top"> <a href = "http://movie.douban.com/people/46770381/">  </a> </td>
Therefore, in Python codetd_tags=soup.findAll('td',width='80',valign='top')Find all<td width="80" valign="top">.
td=td_tags[0],a=td.aYou can get
<A href = "http://movie.douban.com/people/46770381/">  </a>
Passlink=a.get('href')You can get the href attribute, which is also the link to your movie homepage. Then, you can obtain the user ID by searching the string.Complete code
1 # coding = UTF-8 2 # Get user ID 3 4 # web address type: http://movie.douban.com/subject/26289144/collections? Start = 0 5 ## http://movie.douban.com/subject/26289144/collections? Start = 20 6 7 from BeautifulSoup import BeautifulSoup 8 import codecs 9 import time10 import urllib211 12 baseUrl = 'HTTP: // movie.douban.com/subject/25895276/collections? Start = '13 14 proxyInfo = '127. 0.0.1: 8087 '15 proxySupport = urllib2.ProxyHandler ({'http': proxyInfo}) 16 opener = urllib2.build _ opener (proxySupport) 17 urllib2.install _ opener (opener) 18 19 20 # Save User Information (id, Home Page Link) to file 21 def saveUserInfo (idList, linkList): 22 if len (idList )! = Len (linkList): 23 print 'error: len (idList )! = Len (linkList )! '24 return25 writeFile=codecs.open('UserIdList3.txt ', 'A', 'utf-8') 26 size = len (idList) 27 for I in range (size): 28 writeFile. write (idList [I] + '\ t' + linkList [I] +' \ n') 29 writeFile. close () 30 31 # parse the user ID and connect 32 def parseHtmlUserId (html) from the given html text ): 33 idList = [] # returned id list 34 linkList = [] # returned link List 35 36 soup = BeautifulSoup (html) 37 # <td width = "80" valign = "top"> 38 # <a href = "http://movie.douban.com/people/liaaaar/"> 39 #  40 ##</a> 41 ## </td> 42 td_tags = soup. findAll ('td ', width = '80', valign = 'top') 43 I = 044 for td in td_tags: 45 # The first 20 users saw this movie, 46 # the users who want to watch the movie later will discard 47 if I = 20: 48 break49 a = td. a50 link =. get ('href ') 51 I _start = link. find ('People/') 52 id = link [I _start + 7:-1] 53 idList. append (id) 54 linkList. append (link) 55 I + = 156 return (idList, linkList) 57 58 # return the webpage content of the specified number 59 d Ef getHtml (num): 60 url = baseUrl + str (num) 61 page = urllib2.urlopen (url) 62 html = page. read () 63 return html64 65 def launch (): 66 # specify Start number: multiples of 20 67 ques = raw_input ('start from number? (Multiples of 20) ') 68 startNum = int (ques) 69 if startNum % 20! = 0: 70 print 'input number error! '71 return72 for I in range (startNum, 200, 20): 73 print 'loading page % d... '% (I + 1) 74 html = getHtml (I) 75 (curIdList, curLinkList) = parseHtmlUserId (html) 76 saveUserInfo (curIdList, curLinkList) 77' Sleeping. '78 time. sleep (5)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Python] obtain a list of users who have watched this movie in batches from Douban movies. python user list

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support