Python crawler-grabbing embarrassing encyclopedia jokes

Source: Internet
Author: User

Today, using Python crawler to automatically crawl the embarrassing encyclopedia of jokes, because the embarrassing encyclopedia does not need to login, crawl is relatively simple. The program every time the carriage return output a piece, code reference http://cuiqingcai.com/990.html but the blogger's code seems to have some problems, I made a change, run successfully, the following is the code content:

1 #-*-coding:utf-8-*-2 __author__='Jz'3 ImportUrllib24 ImportRe5 6 #Embarrassing encyclopedia Reptile7 classQSBK:8     #Initialize9     def __init__(self):TenSelf.pageindex = 1 OneSelf.user_agent ='mozilla/5.0 (Windows NT 6.1; WOW64)' ASelf.headers = {'user-agent': Self.user_agent} -         #every element of joke is a piece of every page. -Self.joke = [] the         #determine whether to continue running -Self.enable =False -      -     defgetpage (Self, pageIndex): +         Try: -URL ='http://www.qiushibaike.com/hot/page/'+Str (pageIndex) +Request = Urllib2. Request (url = url, headers =self.headers) AResponse =Urllib2.urlopen (Request) atPageContent = Response.read (). Decode ('Utf-8') -             returnPageContent -         exceptUrllib2. Urlerror, E: -             ifHasattr (E,'reason'): -                 Print 'Satin crawl failure, failure reason:', E.reason -                 returnNone in      -     defgetjokelist (Self, pageIndex): toPageContent =self.getpage (PageIndex) +         if  notPageContent: -             Print 'Satin get failed ...' the             returnNone *         #the contents of the third group are used to determine if the satin is accompanied by a picture $Pattern = Re.compile (r'<div.*?class= "Author" >.*?<a.*?>.*?\n (. *?) \n</a>.*?</div>.*?<div class= "Content" >\n\n (. *?) \n<!--. *?-->.*?</div>'+Panax NotoginsengR'(.*?) class= "stats" >.*?<span.*?class= "stats-vote" ><i.*?class= "Number" > (. *?) </i>' - , Re. S) theJokes =Re.findall (Pattern, pagecontent) +Pagejokes = [] A         #Filter the satin with pictures the          forJokeinchJokes: +Hasimg = Re.search ('img', joke[2]) -             #Joke[0] for the publisher, Joke[1] for the satin content, joke[3] for the number of likes $             if  nothasimg: $Pagejokes.append ([Joke[0].strip (), Joke[1].strip (), joke[3].strip ()]) -         returnPagejokes -      the     defloadPage (self): -         ifSelf.enable = =True:Wuyi             #Load a new page if the number of pages currently viewed is less than two the             ifLen (Self.joke) < 2: -Pagejokes =self.getjokelist (Self.pageindex) Wu                 ifPagejokes: - self.joke.append (pagejokes) AboutSelf.pageindex + = 1 $      -     #enter once per input, print a piece -     defGetonejoke (self, Pagejokes, page): -Jokes =Pagejokes A          forJokeinchJokes: +Userinput = Raw_input ('Please enter a enter or q/q:') the self.loadpage () -             ifUserinput = ='Q' orUserinput = ='Q': $Self.enable =False the                 Print 'quit crawler ...' the                 return the             Print u ' satin content:%s\n%d page \ t Publisher:%s\t:%s '% (joke[1], page, joke[0], joke[2]) the          -     defStart (self): in         Print 'from the embarrassing encyclopedia grab the jokes, press ENTER to view new jokes, press q/q exit ...' theSelf.enable =True the self.loadpage () Aboutpage =0 the          whileself.enable: the             ifLen (Self.joke) >0: thePagejokes =Self.joke[0] +Page + = 1 -                 #Delete a page that has been read the                 delSelf.joke[0]Bayi Self.getonejoke (pagejokes, page) the  theSpider =QSBK () -Spider.start ()

Notes are attached, and there are a few points to note:

1. Need to add header verification to disguise, otherwise can not crawl Web content

2. The writing of regular expressions, you need to extract the content to verify that there is an accompanying picture (the code is marked in red)

When formatting the output satin in the 3.getOneJoke function (marked in red), you need to precede the string with u, otherwise the following error will be reported:

Traceback (most recent): File"D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 84,inch<module>Spider.start () File"D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 81,inchstart Self.getonejoke (pagejokes, page) File"D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 68,inchGetonejokePrint 'Satin Content:%s\n page%d \ t publisher:%s\t likes:%s'% (joke[1], page, joke[0], joke[2]) Unicodedecodeerror:'ASCII'Codec can'T decode byte 0xe7 in position 3:ordinal not in range (+)

This is because the Python default encoding is Unicode, so joke[0] and so is Unicode encoding, in order to format the output, the preceding string also needs to be converted to Unicode encoding

Python crawler-grabbing embarrassing encyclopedia jokes

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.