Importurllib.requestImportRe#1 Get home page source code#2 getting Chapter Hyperlinks#3 getting Chapter content#4 Download the novel#Hump Naming Method#comment to get the content of a noveldefgetnovelcontent ():#Get Source code HTTP Response Objecthtml = Urllib.request.urlopen ('http://www.quanshuwang.com/book/0/269/') HTML=Html.read ()#print (HTML) #Set Encodinghtml = Html.decode ('GBK') #Get Hyperlinks #<li><a href= "http://www.quanshuwang.com/book/0/269/78850.html" title= "the first chapter mountain side Small village, a total of 2741 characters" > The first chapter Hill side small village </a></li> #The regular expression wildcard. *? matches all (. *) parentheses inside the desired content grouping matchReg = R'<li><a href= "(. *?)" title= ". *? > (. *?) </a></li>' #The purpose is to increase efficiency, which can not be written, but write betterReg =Re.compile (reg) URLs=Re.findall (reg,html)#print (URLs) forIinchURLs:#print (i[0])Novel_url =I[0] Novel_title= I[1] Chapt=Urllib.request.urlopen (Novel_url). Read () chapt_html= Chapt.decode ('GBK') # | || D R ' |d 'Reg ='</script> (. *?) <script type= "Text/javascript" >' #S Multi-line matchingReg =Re.compile (reg,re. S) Chapt_content=Re.findall (reg,chapt_html)#print (chapt_content[0]) #Replace the useless, notice that the type is replaced by a list, BR is a newline, nbsp is a spaceChapt_content = Chapt_content[0].replace ('<br/>',"") #print (Type (chapt_content))Chapt_content = Chapt_content.replace (' ',"") #Changed from list to string, not indexed below #print (chapt_content) #download, you can add a hint Print("Saving%s"%novel_title)#W Read-write mode WB binary read-write mode, generally used to read and write photos and videos without specific path is automatically added in the Py path, can also be saved as doc format, etc.f = open ('{}.txt'. Format (Novel_title),'W') F.write (chapt_content) f.closegetnovelcontent ()
There is no comment on the minimalist code:
Importurllib.requestImportRedefgetnovelcontent (): HTML= Urllib.request.urlopen ('http://www.quanshuwang.com/book/0/269/') HTML=html.read () HTML= Html.decode ('GBK') Reg= R'<li><a href= "(. *?)" title= ". *? > (. *?) </a></li>'Reg=Re.compile (reg) URLs=Re.findall (reg,html) forIinchUrls:novel_url=I[0] Novel_title= I[1] Chapt=Urllib.request.urlopen (Novel_url). Read () chapt_html= Chapt.decode ('GBK') Reg='</script> (. *?) <script type= "Text/javascript" >'Reg=Re.compile (reg,re. S) Chapt_content=Re.findall (reg,chapt_html) chapt_content= Chapt_content[0].replace ('<br/>',"") Chapt_content= Chapt_content.replace (' ',"") Print("Saving%s"%novel_title) F= Open ('{}.txt'. Format (Novel_title),'W') F.write (chapt_content) f.closegetnovelcontent ()
"Crawlers" use Urllib.request to crawl novels.