Python crawls the book Net novel, read the novel for free

Source: Internet
Author: User

What is a web crawler?

Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script. Other infrequently used names are ants, auto-indexing, simulation programs, or worms.

Environment: Python3.6+windows

Development tools: you like to use whichever you use, you happy good!

Module:

1 Import urllib.request 2 3 Import RE

Main ideas:

    • 1 Get home Page source code

    • 2 Getting Chapter Hyperlinks

    • 3 Get the chapter hyperlink source code

    • 4 Getting the content of the novel

    • 5 Downloads, file operations

Python code to understand
1 Importurllib.request2 ImportRe3 #1 Get home page source code4 #2 getting Chapter Hyperlinks5 #3 Get the chapter hyperlink source code6 #4 Getting the content of the novel7 #5 Downloads, file operations8 9 #Hump Naming MethodTen #get the content of a novel One defgetnovertcontent (): A     # -html = Urllib.request.urlopen ("http://www.quanshuwang.com/book/0/269"). Read () -html = Html.decode ("GBK") the     #Do not match parentheses -     #regular Expressions. *? Match all -Reg = R'<li><a href= "(. *?)" title= ". *? > (. *?) </a></li>' -     #increase the efficiency of +Reg =Re.compile (REG) -URLs =Re.findall (reg,html) +     #print (URLs) A     #List at     #[(http://www.quanshuwang.com/book/0/269/78850.html, chapter I, Hill side small village), -     #(http://www.quanshuwang.com/book/0/269/78854.html, chapter II Green cattle town)] -      forUrlinchURLs: -         #the URL address of the chapter -Novel_url =Url[0] -         #Chapter Title inNovel_title = url[1] -  toChapt =Urllib.request.urlopen (Novel_url). Read () +chapt_html = Chapt.decode ("GBK") -         #r = Native string \ \\d r "\d" theReg = R'</script>&nbsp;&nbsp;&nbsp;&nbsp; (. *?) <script type= "Text/javascript" >' *         #S for multi-line matching $Reg =Re.compile (reg,re. S)Panax NotoginsengChapt_content =Re.findall (reg,chapt_html) -         #print (chapt_content) the         #list ["&nbsp;&nbsp;&nbsp;&nbsp Erlengzi with his eyes wide open, staring straight at the thatch and mud <br/>"] +  A         #The first parameter to replace the string after the replacement theChapt_content = Chapt_content[0].replace ("&nbsp;&nbsp;&nbsp;&nbsp;","") +         #print (chapt_content) string erlengzi wide eyes, staring straight at the thatch and mud <br/> -Chapt_content = Chapt_content.replace ("<br/>","") $  $         Print("Saving%s"%novel_title) -         #W Read-write mode WB -         #f = Open ("{}.txt". Format (Novel_title), ' W ') the         #f.write (chapt_content) - WuyiWith open ("{}.txt". Format (Novel_title),'W') as F: the f.write (chapt_content) -  Wu         #f.close () -  AboutGetnovertcontent ()

Operation Result:

Python crawls the book Net novel, read the novel for free

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.