Capture novels using Python

Source: Internet
Author: User

Preface to capturing novels using Python
This script is written to capture novels on MAC. You can use a few pieces of Python code.


Code

# Coding = utf-8import reimport urllib2import chardetimport sysfrom bs4 import BeautifulSoupimport codecsclass Spider (): def _ init _ (self): self. aTag = re. compile ("] *?> (. + ?) ") Self. contentTag = re. compile (" (. + ?) ", Re. I | re. s) def getHtml (self, url): headers = {'user-agent': 'mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: 1.9.1.6) gecko/20091201 Firefox/3.5.6 '} req = urllib2.Request (url, headers = headers) response = urllib2.urlopen (req) html = response. read () return html # soup = BeautifulSoup (html. decode ("GB18030", "ignore") # return soup. findAll ("a") # return soup. pretiterator () # typeEncode = sys. getfilesystemencodi Ng () # infoencode = chardet. detect (html ). get ('encoding', 'utf-8') # return html. decode ('gb18030', 'ignore '). encode ("UTF-8") return html. decode ('gb18030', 'ignore '). encode (sys. getfilesystemencoding () def Run (self): bookurl = "http://www.44pq.com/read/13567.html" bookname = "the only magician on Earth" text = [] matchs = self. aTag. finditer (self. getHtml (bookurl) alist = list (matchs) total = len (alist) print "total {0 }". format (tot Al) I = 0 for m in alist: I + = 1 text. append (m. group (2 ). decode ("gb18030") text. append (self. getContent (m. group (1) self. writeFile (bookname, "\ n ". join (text) del text [:] print "{0}/{1 }". format (I, total) self. writeFile (bookname, "\ n ". join (text) print "done! "Def writeFile (self, filename, text): f = open (filename + ". txt "," a ") f. write (text) f. close () def getContent (self, url): c = self. getHtml (url) c = self. contentTag. search (c ). group (1) c = re. sub ("<[^>] +?> "," ", C) c = c. replace ("nbsp ;",""). replace ("&", "") return c. decode ("gb18030") if _ name _ = '_ main _': reload (sys) sys. setdefaultencoding ('utf-8') spider = Spider () spider. run ()


The format of the CSDN editor cannot be determined. In the above Code:

Self. writeFile (bookname, "\ n". join (text ))
Del text [:]

These two rows are in the for loop, and should not be aligned with the keyword.



The unnecessary import above can be deleted. Taking the novel "The only magician on Earth" as an example, aTag is a regular expression that matches all the chapters in the novel directory, and contentTag is a regular expression that matches the novel body.

You need to declare that this code writes the file once every time it captures a chapter to prevent excessive memory usage.

Self. writeFile (bookname, "\ n". join (text ))
Del text [:]


If necessary, you can capture N chapters and write them into the file once. You only need to add a simple logical judgment. Each person has his/her own measure of the memory used and the number of files written.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.