Python crawler-grabbing embarrassing encyclopedia jokes

Last Update:2015-10-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, using Python crawler to automatically crawl the embarrassing encyclopedia of jokes, because the embarrassing encyclopedia does not need to login, crawl is relatively simple. The program every time the carriage return output a piece, code reference http://cuiqingcai.com/990.html but the blogger's code seems to have some problems, I made a change, run successfully, the following is the code content:

1 #-*-coding:utf-8-*-2 __author__='Jz'3 ImportUrllib24 ImportRe5 6 #Embarrassing encyclopedia Reptile7 classQSBK:8     #Initialize9     def __init__(self):TenSelf.pageindex = 1 OneSelf.user_agent ='mozilla/5.0 (Windows NT 6.1; WOW64)' ASelf.headers = {'user-agent': Self.user_agent} -         #every element of joke is a piece of every page. -Self.joke = [] the         #determine whether to continue running -Self.enable =False -      -     defgetpage (Self, pageIndex): +         Try: -URL ='http://www.qiushibaike.com/hot/page/'+Str (pageIndex) +Request = Urllib2. Request (url = url, headers =self.headers) AResponse =Urllib2.urlopen (Request) atPageContent = Response.read (). Decode ('Utf-8') -             returnPageContent -         exceptUrllib2. Urlerror, E: -             ifHasattr (E,'reason'): -                 Print 'Satin crawl failure, failure reason:', E.reason -                 returnNone in      -     defgetjokelist (Self, pageIndex): toPageContent =self.getpage (PageIndex) +         if  notPageContent: -             Print 'Satin get failed ...' the             returnNone *         #the contents of the third group are used to determine if the satin is accompanied by a picture $Pattern = Re.compile (r'<div.*?class= "Author" >.*?<a.*?>.*?\n (. *?) \n</a>.*?</div>.*?<div class= "Content" >\n\n (. *?) \n<!--. *?-->.*?</div>'+Panax NotoginsengR'(.*?) class= "stats" >.*?<span.*?class= "stats-vote" ><i.*?class= "Number" > (. *?) </i>' - , Re. S) theJokes =Re.findall (Pattern, pagecontent) +Pagejokes = [] A         #Filter the satin with pictures the          forJokeinchJokes: +Hasimg = Re.search ('img', joke[2]) -             #Joke[0] for the publisher, Joke[1] for the satin content, joke[3] for the number of likes $             if  nothasimg: $Pagejokes.append ([Joke[0].strip (), Joke[1].strip (), joke[3].strip ()]) -         returnPagejokes -      the     defloadPage (self): -         ifSelf.enable = =True:Wuyi             #Load a new page if the number of pages currently viewed is less than two the             ifLen (Self.joke) < 2: -Pagejokes =self.getjokelist (Self.pageindex) Wu                 ifPagejokes: - self.joke.append (pagejokes) AboutSelf.pageindex + = 1 $      -     #enter once per input, print a piece -     defGetonejoke (self, Pagejokes, page): -Jokes =Pagejokes A          forJokeinchJokes: +Userinput = Raw_input ('Please enter a enter or q/q:') the self.loadpage () -             ifUserinput = ='Q' orUserinput = ='Q': $Self.enable =False the                 Print 'quit crawler ...' the                 return the             Print u ' satin content:%s\n%d page \ t Publisher:%s\t:%s '% (joke[1], page, joke[0], joke[2]) the          -     defStart (self): in         Print 'from the embarrassing encyclopedia grab the jokes, press ENTER to view new jokes, press q/q exit ...' theSelf.enable =True the self.loadpage () Aboutpage =0 the          whileself.enable: the             ifLen (Self.joke) >0: thePagejokes =Self.joke[0] +Page + = 1 -                 #Delete a page that has been read the                 delSelf.joke[0]Bayi Self.getonejoke (pagejokes, page) the  theSpider =QSBK () -Spider.start ()

Notes are attached, and there are a few points to note:

1. Need to add header verification to disguise, otherwise can not crawl Web content

2. The writing of regular expressions, you need to extract the content to verify that there is an accompanying picture (the code is marked in red)

When formatting the output satin in the 3.getOneJoke function (marked in red), you need to precede the string with u, otherwise the following error will be reported:

Traceback (most recent): File"D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 84,inch<module>Spider.start () File"D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 81,inchstart Self.getonejoke (pagejokes, page) File"D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 68,inchGetonejokePrint 'Satin Content:%s\n page%d \ t publisher:%s\t likes:%s'% (joke[1], page, joke[0], joke[2]) Unicodedecodeerror:'ASCII'Codec can'T decode byte 0xe7 in position 3:ordinal not in range (+)

This is because the Python default encoding is Unicode, so joke[0] and so is Unicode encoding, in order to format the output, the preceding string also needs to be converted to Unicode encoding

Python crawler-grabbing embarrassing encyclopedia jokes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler-grabbing embarrassing encyclopedia jokes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler-grabbing embarrassing encyclopedia jokes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support