"Disclaimer: Copyright, welcome reprint, please do not use for commercial purposes. Contact mailbox: feixiaoxing @163.com "
The students who like watching movies are certainly not very unfamiliar with the watercress. In general, before we choose the movie we will see, we will go to the watercress to see how the evaluation. If the evaluation is too low, actually see the meaning is not very much, on the contrary if the value of the look is too much, it can not be missed anyway. Most of the points of the watercress are accurate, but can not rule out the situation of the malicious brush integral.
For a long time, I want to put on all the good-looking film clean sweep, but suffer from no ready-made tools, had to write their own. Now the scripting language is much more, Python, go can do such a task. A few simple sentences can do a lot of work.
To accomplish such a task, can we break it down into several things? For example, how to download Web pages? How do I get the movie name? How do I get points? How to crawl and so on.
First, or release my own writing of the Watercress Movie crawl code, as shown below,
#encoding =utf-8#!/usr/bin/pythonimport osimport sysimport reimport timeimport smtplibimport urllibimport Urllib2import tracebackfrom urllib Import urlopenpage=[]def check_num_exist (data): For I in range (0, Len (page)): If data== Page[i]:return Truereturn Falseif __name__ = = ' __main__ ': num = 0;page.append (' 25788662 ') while Num < len (page): " Sleep ' Time.sleep (2) ' Produce URL address ' ' url = ' http://movie.douban.com/subject/' + page[num]num + = 1 ' get web Data ' req = urllib2. Request (str (URL)) req.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11 ') try:request = Urllib2.urlopen (req) except URLLIB2. Urlerror, E:continueexcept urllib2. Httperror, E:continuewebdata = Request.read () "Get title" ' Find=re.search (R ' <title>\n (. *?) \ (. *?\) \n</title> ', WebData) if (None = = find): Continue;title = Find.group (1). Strip (). Decode (' utf-8 ') ' Get Score ' Find=re.search (R ' <strong class=.*? property=.*?> (\d\.\d) ', WebdatA) if (None = = find): Continue;score = Find.group (1) "Print info about the film ' Print ('%s '%s ')% (Url,title,score)" Print WebData "Find=re.findall (R ' http://movie.douban.com/subject/(\d{7,8}) ', WebData) if (0 = = Len (find)): continue; For I in range (0,len (find)): if (False = = Check_num_exist (Find[i])):p age.append (Find[i])(1) Web page download is very simple, urllib2 formula can be solved;
(2) Obtaining title and score are mainly done by regular expressions;
(3) Web crawling method, the use of depth first choice, relatively simple, but also easier to achieve;
(4) A lot of times, the site will be a continuous external crawling masking processing, so we need to use Add_header to disguise us as a browser access;
(5) Web crawling can not be too much, usually in the middle of sleep for a few seconds;
(6) All the anomalies in the crawling process should be taken into account, otherwise it will not work long time;
(7) Actively discover the laws of Web pages, such as the basic movie.douban.com/subject/* of the watercress film is the structure;
(8) Bold try, positive change wrong on it, seldom can one step.
The above is just the movie crawling, simple transformation is the book query,
#encoding =utf-8#!/usr/bin/pythonimport osimport sysimport reimport timeimport smtplibimport urllibimport Urllib2import tracebackfrom urllib Import urlopenpage=[]def check_num_exist (data): For I in range (0, Len (page)): If data== Page[i]:return Truereturn Falseif __name__ = = ' __main__ ': num = 0;page.append (' 25843109 ') while Num < len (page): " Sleep ' Time.sleep (2) ' Produce URL address ' ' url = ' http://book.douban.com/subject/' + page[num]num + = 1 ' ' Get Web Data "' req = urllib2. Request (str (URL)) req.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11 ') try:request = Urllib2.urlopen (req) except URLLIB2. Urlerror, E:continueexcept urllib2. Httperror, E:continuewebdata = Request.read () "Get title" ' Find=re.search (R ' <title> (. *?) \ (. *?\) </title> ', WebData) if (None = = find): Continue;title = Find.group (1). Strip (). Decode (' utf-8 ') ' Get score ' "Find=re.search (R ' <strong class=.*? property=.*?>\n.*?" ( \d\.\d) ', WebdatA) if (None = = find): Continue;score = Find.group (1) "Print info about the film ' Print ('%s '%s ')% (Url,title,score)" Print WebData "Find=re.findall (R ' http://book.douban.com/subject/(\d{7,8}) ', WebData) if (0 = = Len (find)): continue; For I in range (0,len (find)): if (False = = Check_num_exist (Find[i])):p age.append (Find[i])
And the Crawling of music,
#encoding =utf-8#!/usr/bin/pythonimport osimport sysimport reimport timeimport smtplibimport urllibimport Urllib2import tracebackfrom urllib Import urlopenpage=[]def check_num_exist (data): For I in range (0, Len (page)): If data== Page[i]:return Truereturn Falseif __name__ = = ' __main__ ': num = 0;page.append (' 25720661 ') while Num < len (page): " Sleep ' Time.sleep (2) ' Produce URL address ' ' url = ' http://music.douban.com/subject/' + page[num]num + = 1 ' get web Data ' req = urllib2. Request (str (URL)) req.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11 ') try:request = Urllib2.urlopen (req) except URLLIB2. Urlerror, E:continueexcept urllib2. Httperror, E:continuewebdata = Request.read () "Get title" ' Find=re.search (R ' <title>\n (. *?) \ (. *?\) \n</title> ', WebData) if (None = = find): Continue;title = Find.group (1). Strip (). Decode (' utf-8 ') ' Get Score ' Find=re.search (R ' <strong class=.*? property=.*?> (\d\.\d) ', WebdatA) if (None = = find): Continue;score = Find.group (1) "Print info about the film ' Print ('%s '%s ')% (Url,title,score)" Print WebData "Find=re.findall (R ' http://music.douban.com/subject/(\d{7,8}) ', WebData) if (0 = = Len (find)): continue; For I in range (0,len (find)): if (False = = Check_num_exist (Find[i])):p age.append (Find[i])
Finally, is the interest group crawling, this is the best play. You can see a lot of funny interest groups,
#encoding =utf-8#!/usr/bin/pythonimport osimport sysimport reimport timeimport smtplibimport urllibimport Urllib2import tracebackfrom urllib Import urlopenpage=[]def check_num_exist (data): For I in range (0, Len (page)): If data== Page[i]:return Truereturn Falseif __name__ = = ' __main__ ': num = 0;page.append (' angel ') while num < len (page): ' Sleep ' ' Time.sleep (2) ' Produce URL address ' ' url = ' http://www.douban.com/group/' + page[num]num + = 1 ' get Web data ' req = URL Lib2. Request (str (URL)) req.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11 ') try:request = Urllib2.urlopen (req) except URLLIB2. Urlerror, E:continueexcept urllib2. Httperror, E:continuewebdata = Request.read () "Get title" ' Find=re.search (R ' <title>\n (. *?) \n</title> ', WebData) if (None = = find): Continue;title = Find.group (1). Strip (). Decode (' utf-8 ') ' Print info about The film ' Print ('%s ')% (url,title) ' Print WebData ' Find=re.findall (R ' HTtp://www.douban.com/group/([\w|\d]+?) /', WebData) if (0 = = Len (find)): Continue;for I in range (0,len (find)): if (False = = Check_num_exist (Find[i])):p Age.append ( Find[i])
Finally, it is hoped that when everyone crawls the watercress, try to be civilized a little. Too often, it will be forbidden by the server.
Random record (Douban station crawling)