Random record (Douban station crawling)

Source: Internet
Author: User


"Disclaimer: Copyright, welcome reprint, please do not use for commercial purposes. Contact mailbox: feixiaoxing @163.com "

The students who like watching movies are certainly not very unfamiliar with the watercress. In general, before we choose the movie we will see, we will go to the watercress to see how the evaluation. If the evaluation is too low, actually see the meaning is not very much, on the contrary if the value of the look is too much, it can not be missed anyway. Most of the points of the watercress are accurate, but can not rule out the situation of the malicious brush integral.
For a long time, I want to put on all the good-looking film clean sweep, but suffer from no ready-made tools, had to write their own. Now the scripting language is much more, Python, go can do such a task. A few simple sentences can do a lot of work.
To accomplish such a task, can we break it down into several things? For example, how to download Web pages? How do I get the movie name? How do I get points? How to crawl and so on.

First, or release my own writing of the Watercress Movie crawl code, as shown below,

#encoding =utf-8#!/usr/bin/pythonimport osimport sysimport reimport timeimport smtplibimport urllibimport Urllib2import tracebackfrom urllib Import urlopenpage=[]def check_num_exist (data): For I in range (0, Len (page)): If data== Page[i]:return Truereturn Falseif __name__ = = ' __main__ ': num = 0;page.append (' 25788662 ') while Num < len (page): " Sleep ' Time.sleep (2) ' Produce URL address ' ' url = ' http://movie.douban.com/subject/' + page[num]num + = 1 ' get web Data ' req = urllib2. Request (str (URL)) req.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11 ') try:request = Urllib2.urlopen (req) except URLLIB2. Urlerror, E:continueexcept urllib2. Httperror, E:continuewebdata = Request.read () "Get title" ' Find=re.search (R ' <title>\n (. *?) \ (. *?\) \n</title> ', WebData) if (None = = find): Continue;title = Find.group (1). Strip (). Decode (' utf-8 ') ' Get Score ' Find=re.search (R ' <strong class=.*? property=.*?> (\d\.\d) ', WebdatA) if (None = = find): Continue;score = Find.group (1) "Print info about the film ' Print ('%s '%s ')% (Url,title,score)" Print WebData "Find=re.findall (R ' http://movie.douban.com/subject/(\d{7,8}) ', WebData) if (0 = = Len (find)): continue; For I in range (0,len (find)): if (False = = Check_num_exist (Find[i])):p age.append (Find[i])
(1) Web page download is very simple, urllib2 formula can be solved;
(2) Obtaining title and score are mainly done by regular expressions;
(3) Web crawling method, the use of depth first choice, relatively simple, but also easier to achieve;
(4) A lot of times, the site will be a continuous external crawling masking processing, so we need to use Add_header to disguise us as a browser access;
(5) Web crawling can not be too much, usually in the middle of sleep for a few seconds;
(6) All the anomalies in the crawling process should be taken into account, otherwise it will not work long time;
(7) Actively discover the laws of Web pages, such as the basic movie.douban.com/subject/* of the watercress film is the structure;
(8) Bold try, positive change wrong on it, seldom can one step.

The above is just the movie crawling, simple transformation is the book query,

#encoding =utf-8#!/usr/bin/pythonimport osimport sysimport reimport timeimport smtplibimport urllibimport Urllib2import tracebackfrom urllib Import urlopenpage=[]def check_num_exist (data): For I in range (0, Len (page)): If data== Page[i]:return Truereturn Falseif __name__ = = ' __main__ ': num = 0;page.append (' 25843109 ') while Num < len (page): " Sleep ' Time.sleep (2) ' Produce URL address ' ' url = ' http://book.douban.com/subject/' + page[num]num + = 1 ' ' Get Web Data "' req = urllib2. Request (str (URL)) req.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11 ') try:request = Urllib2.urlopen (req) except URLLIB2. Urlerror, E:continueexcept urllib2. Httperror, E:continuewebdata = Request.read () "Get title" ' Find=re.search (R ' <title> (. *?) \ (. *?\) </title> ', WebData) if (None = = find): Continue;title = Find.group (1). Strip (). Decode (' utf-8 ') ' Get score ' "Find=re.search (R ' <strong class=.*? property=.*?>\n.*?" ( \d\.\d) ', WebdatA) if (None = = find): Continue;score = Find.group (1) "Print info about the film ' Print ('%s '%s ')% (Url,title,score)" Print WebData "Find=re.findall (R ' http://book.douban.com/subject/(\d{7,8}) ', WebData) if (0 = = Len (find)): continue; For I in range (0,len (find)): if (False = = Check_num_exist (Find[i])):p age.append (Find[i])

And the Crawling of music,

#encoding =utf-8#!/usr/bin/pythonimport osimport sysimport reimport timeimport smtplibimport urllibimport Urllib2import tracebackfrom urllib Import urlopenpage=[]def check_num_exist (data): For I in range (0, Len (page)): If data== Page[i]:return Truereturn Falseif __name__ = = ' __main__ ': num = 0;page.append (' 25720661 ') while Num < len (page): " Sleep ' Time.sleep (2) ' Produce URL address ' ' url = ' http://music.douban.com/subject/' + page[num]num + = 1 ' get web Data ' req = urllib2. Request (str (URL)) req.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11 ') try:request = Urllib2.urlopen (req) except URLLIB2. Urlerror, E:continueexcept urllib2. Httperror, E:continuewebdata = Request.read () "Get title" ' Find=re.search (R ' <title>\n (. *?) \ (. *?\) \n</title> ', WebData) if (None = = find): Continue;title = Find.group (1). Strip (). Decode (' utf-8 ') ' Get Score ' Find=re.search (R ' <strong class=.*? property=.*?> (\d\.\d) ', WebdatA) if (None = = find): Continue;score = Find.group (1) "Print info about the film ' Print ('%s '%s ')% (Url,title,score)" Print WebData "Find=re.findall (R ' http://music.douban.com/subject/(\d{7,8}) ', WebData) if (0 = = Len (find)): continue; For I in range (0,len (find)): if (False = = Check_num_exist (Find[i])):p age.append (Find[i])

Finally, is the interest group crawling, this is the best play. You can see a lot of funny interest groups,

#encoding =utf-8#!/usr/bin/pythonimport osimport sysimport reimport timeimport smtplibimport urllibimport Urllib2import tracebackfrom urllib Import urlopenpage=[]def check_num_exist (data): For I in range (0, Len (page)): If data== Page[i]:return Truereturn Falseif __name__ = = ' __main__ ': num = 0;page.append (' angel ') while num < len (page): ' Sleep ' ' Time.sleep (2) ' Produce URL address ' ' url = ' http://www.douban.com/group/' + page[num]num + = 1 ' get Web data ' req = URL Lib2. Request (str (URL)) req.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11 ') try:request = Urllib2.urlopen (req) except URLLIB2. Urlerror, E:continueexcept urllib2. Httperror, E:continuewebdata = Request.read () "Get title" ' Find=re.search (R ' <title>\n (. *?) \n</title> ', WebData) if (None = = find): Continue;title = Find.group (1). Strip (). Decode (' utf-8 ') ' Print info about The film ' Print ('%s ')% (url,title) ' Print WebData ' Find=re.findall (R ' HTtp://www.douban.com/group/([\w|\d]+?) /', WebData) if (0 = = Len (find)): Continue;for I in range (0,len (find)): if (False = = Check_num_exist (Find[i])):p Age.append ( Find[i])

Finally, it is hoped that when everyone crawls the watercress, try to be civilized a little. Too often, it will be forbidden by the server.



Random record (Douban station crawling)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.