Python crawl Fiction

Source: Internet
Author: User

Preface to the Python crawl novel
This script commands Mac to write in a crawl novel, using Python which has several yards.



Code

# coding=utf-8import Reimport urllib2import chardetimport sysfrom bs4 import beautifulsoupimport codecsclass Spider (): def __init__ (self): Self.atag=re.compile ("<a href=\" (http://www.44pq.com/read/[0-9]+?_[0-9]+?. html) \ "[^>]*?> (. +?) </a> ") self.contenttag=re.compile (" <div class=\ "readercontent\" id=\ "content\" > (. +?) </div> ", Re. I|re. S) def gethtml (self, url): headers = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '} req=urllib2. Request (url,headers=headers) response = Urllib2.urlopen (req) HTML = response.read () return HT ML #soup =beautifulsoup (Html.decode ("GB18030", "ignore")) #return Soup.findall ("a") #return Soup.pretti        FY () #typeEncode = sys.getfilesystemencoding () #infoencode = Chardet.detect (HTML). Get (' encoding ', ' utf-8 ') #return html.decode (' GB18030 ', ' ignore '). Encode ("Utf-8") return Html.decode(' GB18030 ', ' ignore '). Encode (Sys.getfilesystemencoding ()) def Run (self): bookurl= "HTTP://WWW.44PQ.COM/READ/13 567.html "Bookname=" The only magician on Earth "text=[" Matchs=self.atag.finditer (self.gethtml (bookurl)) alist=            List (matchs) total = Len (alist) print "All {0}". Format (All) i=0 for M in Alist: I+=1 Text.append (M.group (2). Decode ("GB18030")) Text.append (Self.getcontent (M.group (1))) Self.wri Tefile (bookname, "\ n"). Join (text) del text[:] print "{0}/{1}". Format (i,total) Self.writefile (Booknam    E, "\ n". Join (text)) print "done!" def WriteFile (Self,filename,text): F=open (filename+ ". txt", "a") f.write (text) f.close () def Getcont ENT (Self,url): c=self.gethtml (URL) c=self.contenttag.search (c). Group (1) c=re.sub ("&LT;[^&GT;] +?> "," ", c) c=c.replace (" nbsp; "," "). Replace (" & "," ") Return C.decode (" GB18030 ") if__name__ = = ' __main__ ': Reload (SYS) sys.setdefaultencoding (' utf-8 ') spider = Spider () spider. Run ()


I'm going to make a statement. CSDN editor format problem, in the above code:

Self.writefile (bookname, "\ n"). Join (text))
Del text[:]

These two lines are in the For loop and should not be aligned with the keywordfor.



The unnecessary import above can be erased. Take the novel "The Only Magician on Earth" as an example. Atag is a regular form that matches all chapters of the novel's folder, and Contenttag is the regular form of matching the body of the novel.

It is necessary to declare that this code fetches a chapter each. Write the file one at a time. Avoid excessive memory usage.

Self.writefile (bookname, "\ n"). Join (text))
Del text[:]


Suppose you need to, you can also grab N to write to the file once, just add a simple logical inference is OK. How much memory is consumed and how many files are written, each person has its own different metrics.





Copyright notice: This article blog original articles, blogs, without consent, may not be reproduced.

Python crawl Fiction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.