Today, using Python crawler to automatically crawl the embarrassing encyclopedia of jokes, because the embarrassing encyclopedia does not need to login, crawl is relatively simple. The program every time the carriage return output a piece, code reference http://cuiqingcai.com/990.html but the blogger's code seems to have some problems, I made a change, run successfully, the following is the code content:
1 #-*-coding:utf-8-*-2 __author__='Jz'3 ImportUrllib24 ImportRe5 6 #Embarrassing encyclopedia Reptile7 classQSBK:8 #Initialize9 def __init__(self):TenSelf.pageindex = 1 OneSelf.user_agent ='mozilla/5.0 (Windows NT 6.1; WOW64)' ASelf.headers = {'user-agent': Self.user_agent} - #every element of joke is a piece of every page. -Self.joke = [] the #determine whether to continue running -Self.enable =False - - defgetpage (Self, pageIndex): + Try: -URL ='http://www.qiushibaike.com/hot/page/'+Str (pageIndex) +Request = Urllib2. Request (url = url, headers =self.headers) AResponse =Urllib2.urlopen (Request) atPageContent = Response.read (). Decode ('Utf-8') - returnPageContent - exceptUrllib2. Urlerror, E: - ifHasattr (E,'reason'): - Print 'Satin crawl failure, failure reason:', E.reason - returnNone in - defgetjokelist (Self, pageIndex): toPageContent =self.getpage (PageIndex) + if notPageContent: - Print 'Satin get failed ...' the returnNone * #the contents of the third group are used to determine if the satin is accompanied by a picture $Pattern = Re.compile (r'<div.*?class= "Author" >.*?<a.*?>.*?\n (. *?) \n</a>.*?</div>.*?<div class= "Content" >\n\n (. *?) \n<!--. *?-->.*?</div>'+Panax NotoginsengR'(.*?) class= "stats" >.*?<span.*?class= "stats-vote" ><i.*?class= "Number" > (. *?) </i>' - , Re. S) theJokes =Re.findall (Pattern, pagecontent) +Pagejokes = [] A #Filter the satin with pictures the forJokeinchJokes: +Hasimg = Re.search ('img', joke[2]) - #Joke[0] for the publisher, Joke[1] for the satin content, joke[3] for the number of likes $ if nothasimg: $Pagejokes.append ([Joke[0].strip (), Joke[1].strip (), joke[3].strip ()]) - returnPagejokes - the defloadPage (self): - ifSelf.enable = =True:Wuyi #Load a new page if the number of pages currently viewed is less than two the ifLen (Self.joke) < 2: -Pagejokes =self.getjokelist (Self.pageindex) Wu ifPagejokes: - self.joke.append (pagejokes) AboutSelf.pageindex + = 1 $ - #enter once per input, print a piece - defGetonejoke (self, Pagejokes, page): -Jokes =Pagejokes A forJokeinchJokes: +Userinput = Raw_input ('Please enter a enter or q/q:') the self.loadpage () - ifUserinput = ='Q' orUserinput = ='Q': $Self.enable =False the Print 'quit crawler ...' the return the Print u ' satin content:%s\n%d page \ t Publisher:%s\t:%s '% (joke[1], page, joke[0], joke[2]) the - defStart (self): in Print 'from the embarrassing encyclopedia grab the jokes, press ENTER to view new jokes, press q/q exit ...' theSelf.enable =True the self.loadpage () Aboutpage =0 the whileself.enable: the ifLen (Self.joke) >0: thePagejokes =Self.joke[0] +Page + = 1 - #Delete a page that has been read the delSelf.joke[0]Bayi Self.getonejoke (pagejokes, page) the theSpider =QSBK () -Spider.start ()
Notes are attached, and there are a few points to note:
1. Need to add header verification to disguise, otherwise can not crawl Web content
2. The writing of regular expressions, you need to extract the content to verify that there is an accompanying picture (the code is marked in red)
When formatting the output satin in the 3.getOneJoke function (marked in red), you need to precede the string with u, otherwise the following error will be reported:
Traceback (most recent): File"D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 84,inch<module>Spider.start () File"D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 81,inchstart Self.getonejoke (pagejokes, page) File"D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 68,inchGetonejokePrint 'Satin Content:%s\n page%d \ t publisher:%s\t likes:%s'% (joke[1], page, joke[0], joke[2]) Unicodedecodeerror:'ASCII'Codec can'T decode byte 0xe7 in position 3:ordinal not in range (+)
This is because the Python default encoding is Unicode, so joke[0] and so is Unicode encoding, in order to format the output, the preceding string also needs to be converted to Unicode encoding
Python crawler-grabbing embarrassing encyclopedia jokes