Preface to the Python crawl novel
This script commands Mac to write in a crawl novel, using Python which has several yards.
Code
# coding=utf-8import Reimport urllib2import chardetimport sysfrom bs4 import beautifulsoupimport codecsclass Spider (): def __init__ (self): Self.atag=re.compile ("<a href=\" (http://www.44pq.com/read/[0-9]+?_[0-9]+?. html) \ "[^>]*?> (. +?) </a> ") self.contenttag=re.compile (" <div class=\ "readercontent\" id=\ "content\" > (. +?) </div> ", Re. I|re. S) def gethtml (self, url): headers = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '} req=urllib2. Request (url,headers=headers) response = Urllib2.urlopen (req) HTML = response.read () return HT ML #soup =beautifulsoup (Html.decode ("GB18030", "ignore")) #return Soup.findall ("a") #return Soup.pretti FY () #typeEncode = sys.getfilesystemencoding () #infoencode = Chardet.detect (HTML). Get (' encoding ', ' utf-8 ') #return html.decode (' GB18030 ', ' ignore '). Encode ("Utf-8") return Html.decode(' GB18030 ', ' ignore '). Encode (Sys.getfilesystemencoding ()) def Run (self): bookurl= "HTTP://WWW.44PQ.COM/READ/13 567.html "Bookname=" The only magician on Earth "text=[" Matchs=self.atag.finditer (self.gethtml (bookurl)) alist= List (matchs) total = Len (alist) print "All {0}". Format (All) i=0 for M in Alist: I+=1 Text.append (M.group (2). Decode ("GB18030")) Text.append (Self.getcontent (M.group (1))) Self.wri Tefile (bookname, "\ n"). Join (text) del text[:] print "{0}/{1}". Format (i,total) Self.writefile (Booknam E, "\ n". Join (text)) print "done!" def WriteFile (Self,filename,text): F=open (filename+ ". txt", "a") f.write (text) f.close () def Getcont ENT (Self,url): c=self.gethtml (URL) c=self.contenttag.search (c). Group (1) c=re.sub ("<[^>] +?> "," ", c) c=c.replace (" nbsp; "," "). Replace (" & "," ") Return C.decode (" GB18030 ") if__name__ = = ' __main__ ': Reload (SYS) sys.setdefaultencoding (' utf-8 ') spider = Spider () spider. Run ()
I'm going to make a statement. CSDN editor format problem, in the above code:
Self.writefile (bookname, "\ n"). Join (text))
Del text[:]
These two lines are in the For loop and should not be aligned with the keywordfor.
The unnecessary import above can be erased. Take the novel "The Only Magician on Earth" as an example. Atag is a regular form that matches all chapters of the novel's folder, and Contenttag is the regular form of matching the body of the novel.
It is necessary to declare that this code fetches a chapter each. Write the file one at a time. Avoid excessive memory usage.
Self.writefile (bookname, "\ n"). Join (text))
Del text[:]
Suppose you need to, you can also grab N to write to the file once, just add a simple logical inference is OK. How much memory is consumed and how many files are written, each person has its own different metrics.
Copyright notice: This article blog original articles, blogs, without consent, may not be reproduced.
Python crawl Fiction