[Python] obtain a list of users who have watched this movie in batches from Douban movies. python user list

Source: Internet
Author: User

[Python] obtain a list of users who have watched this movie in batches from Douban movies. python user list
Preface

Since a later experiment requires a large amount of movie data from Douban users, I thought of getting more active Douban movie users from the Douban movie "Douban members who have watched this movie" page.

Link Analysis

This isI have seen Douban members of "Imitation Games"Webpage link:http://movie.douban.com/subject/10463953/collections.

One page shows 20 Douban users who have watched the movie. When the next page is clicked, the current connection changes:http://movie.douban.com/subject/10463953/collections?start=20.

Therefore, when the next page is requested, the index after "start" is actually increased by 20.

Therefore, we can setbase_url='http://movie.douban.com/subject/10463953/collections?start=',i=range(0,200,20), In the loopurl=base_url+str(i).

The reason for setting the maximum I value to 180 was that after testing, Douban only provided the most recent 200 users who saw a movie.

Read webpage

During access, I set up an HTTP proxy. In order to prevent the access frequency from getting too frequent, the ip address is blocked by Douban. Every time I read a webpage, I calltime.sleep(5)Wait for 5 seconds. When the program is running, do other things well.

Web page resolution

This time, we use the BeautifulSoup library to parse html.
Every user information in html is as follows:

<Table width = "100%" class = ""> <tr> <td width = "80" valign = "top"> <a href = "http://movie.douban.com/people/46770381/">  </a> </td> <td valign = "top"> <div class = "pl2"> 

 

First, initialize with the read htmlsoup=BeautifulSoup(html). The required information is only the user ID and the user's movie homepage. Therefore, the truly useful information is in this Code:

<Td width = "80" valign = "top"> <a href = "http://movie.douban.com/people/46770381/">  </a> </td>

Therefore, in Python codetd_tags=soup.findAll('td',width='80',valign='top')Find all<td width="80" valign="top">.

td=td_tags[0],a=td.aYou can get

<A href = "http://movie.douban.com/people/46770381/">  </a>

Passlink=a.get('href')You can get the href attribute, which is also the link to your movie homepage. Then, you can obtain the user ID by searching the string.

Complete code
1 # coding = UTF-8 2 # Get user ID 3 4 # web address type: http://movie.douban.com/subject/26289144/collections? Start = 0 5 ## http://movie.douban.com/subject/26289144/collections? Start = 20 6 7 from BeautifulSoup import BeautifulSoup 8 import codecs 9 import time10 import urllib211 12 baseUrl = 'HTTP: // movie.douban.com/subject/25895276/collections? Start = '13 14 proxyInfo = '127. 0.0.1: 8087 '15 proxySupport = urllib2.ProxyHandler ({'http': proxyInfo}) 16 opener = urllib2.build _ opener (proxySupport) 17 urllib2.install _ opener (opener) 18 19 20 # Save User Information (id, Home Page Link) to file 21 def saveUserInfo (idList, linkList): 22 if len (idList )! = Len (linkList): 23 print 'error: len (idList )! = Len (linkList )! '24 return25 writeFile=codecs.open('UserIdList3.txt ', 'A', 'utf-8') 26 size = len (idList) 27 for I in range (size): 28 writeFile. write (idList [I] + '\ t' + linkList [I] +' \ n') 29 writeFile. close () 30 31 # parse the user ID and connect 32 def parseHtmlUserId (html) from the given html text ): 33 idList = [] # returned id list 34 linkList = [] # returned link List 35 36 soup = BeautifulSoup (html) 37 # <td width = "80" valign = "top"> 38 # <a href = "http://movie.douban.com/people/liaaaar/"> 39 #  40 ##</a> 41 ## </td> 42 td_tags = soup. findAll ('td ', width = '80', valign = 'top') 43 I = 044 for td in td_tags: 45 # The first 20 users saw this movie, 46 # the users who want to watch the movie later will discard 47 if I = 20: 48 break49 a = td. a50 link =. get ('href ') 51 I _start = link. find ('People/') 52 id = link [I _start + 7:-1] 53 idList. append (id) 54 linkList. append (link) 55 I + = 156 return (idList, linkList) 57 58 # return the webpage content of the specified number 59 d Ef getHtml (num): 60 url = baseUrl + str (num) 61 page = urllib2.urlopen (url) 62 html = page. read () 63 return html64 65 def launch (): 66 # specify Start number: multiples of 20 67 ques = raw_input ('start from number? (Multiples of 20) ') 68 startNum = int (ques) 69 if startNum % 20! = 0: 70 print 'input number error! '71 return72 for I in range (startNum, 200, 20): 73 print 'loading page % d... '% (I + 1) 74 html = getHtml (I) 75 (curIdList, curLinkList) = parseHtmlUserId (html) 76 saveUserInfo (curIdList, curLinkList) 77' Sleeping. '78 time. sleep (5)

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.