Follow the rest of the Great God's blog study, the original in this: http://cuiqingcai.com/990.html
Key points to be drawn:
1. The Str.strip () strip function will remove the extra white space characters from the string
2. Response.read (). Decode (' utf-8 ', ' ignore ') to add ' ignore ' to ignore illegal characters, otherwise always report decoding errors
3. In Python 3.x, raw_input is changed to input.
4. The code is best to use notepad++ to write a clear, easy to find mistakes, especially indentation and Chinese punctuation errors
5.. *? A common combination, the latter? Represents a non-greedy mode
With the python3.4.3 implementation of the embarrassing reptile code as follows (that is, according to the great God's copy, the 2.x part to change it):
Importurllib.requestImportUrllib.parseImportReImport Time#Embarrassing encyclopedia ReptileclassQSBK:#Initialize method, define some variables def __init__(self): Self.pageindex= 1self.user_agent='mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.93 safari/537.36'self.headers= {'user-agent': Self.user_agent}#A variable that holds a satin, each of which is a joke on each pageSelf.stories = [] #variables to keep the program runningSelf.enable =False#get the page code for the index passed to a page defgetpage (Self, pageIndex):Try: URL='http://www.qiushibaike.com/hot/page/'+Str (pageIndex) Request= Urllib.request.Request (url, headers =self.headers) Response=Urllib.request.urlopen (Request) Pagecode= Response.read (). Decode ('Utf-8','Ignore')#This ignore ignore illegal characters must add otherwise the general report decoding error returnPagecodeexceptUrllib.error.URLError as E:ifHasattr (E,"reason"): Print(U"Connection embarrassing encyclopedia failure, error Reason:", E.reason)returnNone#Pass in a page of code and return to the list of jokes on this page that keeps pictures defGetpageitems (Self, pageIndex): Pagecode=self.getpage (PageIndex)if notPagecode:Print(U"page load failed ....") returnNone Pattern= Re.compile ('<div.*?author ">.*?<a.*? (. *?) </a>.*?<div.*?'+'content "> (. *?) <!--(. *?) -->.*?</div> (. *?) <div class= "stats.*?class=" Number "> (. *?) </i>', Re. S) Items=Re.findall (Pattern, Pagecode)#a joke to store every pagePagestories = [] forIteminchitems:haveimg= Re.search ("img", item[3]) if notHaveimg:replacebr= Re.compile ('<br/>') Text= Re.sub (Replacebr,"\ n", item[1]) pagestories.append ([Item[0].strip (), Text.strip (), item[4].strip ()])#. Strip () to remove whitespace characters returnpagestories#load and extract the contents of the page and add it to the list defloadPage (self):#Load a new page if the number of pages that are not currently viewed is less than 2 pages ifSelf.enable = =True:ifLen (self.stories) < 2: #get a new pagePagestories =Self.getpageitems (Self.pageindex)#Store the page's jokes in the global list ifpageStories:self.stories.append (pagestories)#page number plus 1, next read Next pageSelf.pageindex + = 1#every time you hit enter to print a joke defgetonestory (self,pagestories,page):#walk through a page of jokes forStoryinchpagestories:#wait for user inputInput_v =input ()#whenever you enter a carriage return, determine if you want to load a new pageself.loadpage ()#If you enter Q then the program ends ifInput_v = ="Q": Self.enable=Falsereturn Print(U"page%d \ t publisher:%s\t:%s\n%s"% (page, story[0], story[2],story[1])) #Start Method defStart (self):Print(U"reading embarrassing encyclopedia, press ENTER to view new jokes, Q exit") #make the variable true, the program can run correctlySelf.enable =True#load one page of content firstself.loadpage ()#local variables, control currently read to 2 the first few pagesNowpage =0 whileself.enable:ifLen (self.stories) >0:#get a page of jokes from the global listPagestories =Self.stories[0]#Number of pages currently read plus 1Nowpage + = 1#Remove an element that has been removed delSelf.stories[0]#The satin that outputs the pageself.getonestory (pagestories,nowpage) Spider=QSBK () Spider.start ( )
"Python" transcription of the Great God's Embarrassing encyclopedia code