What is a web crawler?
Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script. Other infrequently used names are ants, auto-indexing, simulation programs, or worms.
Environment: Python3.6+windows
Development tools: you like to use whichever you use, you happy good!
Module:
1 Import urllib.request 2 3 Import RE
Main ideas:
1 Get home Page source code
2 Getting Chapter Hyperlinks
3 Get the chapter hyperlink source code
4 Getting the content of the novel
5 Downloads, file operations
Python code to understand
1 Importurllib.request2 ImportRe3 #1 Get home page source code4 #2 getting Chapter Hyperlinks5 #3 Get the chapter hyperlink source code6 #4 Getting the content of the novel7 #5 Downloads, file operations8 9 #Hump Naming MethodTen #get the content of a novel One defgetnovertcontent (): A # -html = Urllib.request.urlopen ("http://www.quanshuwang.com/book/0/269"). Read () -html = Html.decode ("GBK") the #Do not match parentheses - #regular Expressions. *? Match all -Reg = R'<li><a href= "(. *?)" title= ". *? > (. *?) </a></li>' - #increase the efficiency of +Reg =Re.compile (REG) -URLs =Re.findall (reg,html) + #print (URLs) A #List at #[(http://www.quanshuwang.com/book/0/269/78850.html, chapter I, Hill side small village), - #(http://www.quanshuwang.com/book/0/269/78854.html, chapter II Green cattle town)] - forUrlinchURLs: - #the URL address of the chapter -Novel_url =Url[0] - #Chapter Title inNovel_title = url[1] - toChapt =Urllib.request.urlopen (Novel_url). Read () +chapt_html = Chapt.decode ("GBK") - #r = Native string \ \\d r "\d" theReg = R'</script> (. *?) <script type= "Text/javascript" >' * #S for multi-line matching $Reg =Re.compile (reg,re. S)Panax NotoginsengChapt_content =Re.findall (reg,chapt_html) - #print (chapt_content) the #list ["   Erlengzi with his eyes wide open, staring straight at the thatch and mud <br/>"] + A #The first parameter to replace the string after the replacement theChapt_content = Chapt_content[0].replace (" ","") + #print (chapt_content) string erlengzi wide eyes, staring straight at the thatch and mud <br/> -Chapt_content = Chapt_content.replace ("<br/>","") $ $ Print("Saving%s"%novel_title) - #W Read-write mode WB - #f = Open ("{}.txt". Format (Novel_title), ' W ') the #f.write (chapt_content) - WuyiWith open ("{}.txt". Format (Novel_title),'W') as F: the f.write (chapt_content) - Wu #f.close () - AboutGetnovertcontent ()
Operation Result:
Python crawls the book Net novel, read the novel for free