Python抓取小說

最後更新：2015-08-06 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：

Python抓取小說前言
這個指令碼命令MAC在抓取小說寫，使用Python它有幾個碼。

代碼

# coding=utf-8import reimport urllib2import chardetimport sysfrom bs4 import BeautifulSoupimport codecsclass Spider():    def __init__(self):        self.aTag=re.compile("<a href=\"(http://www.44pq.com/read/[0-9]+?_[0-9]+?.html)\"[^>]*?>(.+?)</a>")        self.contentTag=re.compile("<div class=\"readerContent\" id=\"content\">(.+?)</div>",re.I|re.S)    def getHtml(self, url):        headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}        req=urllib2.Request(url,headers=headers)        response = urllib2.urlopen(req)                html = response.read()        return html        #soup=BeautifulSoup(html.decode("GB18030","ignore"))        #return soup.findAll("a")        #return soup.prettify()        #typeEncode = sys.getfilesystemencoding()        #infoencode = chardet.detect(html).get(‘encoding‘,‘utf-8‘)        #return html.decode(‘GB18030‘,‘ignore‘).encode("utf-8")        return html.decode(‘GB18030‘,‘ignore‘).encode(sys.getfilesystemencoding())        def Run(self):        bookurl="http://www.44pq.com/read/13567.html"        bookname="地球上唯一的魔法師"        text=[]        matchs=self.aTag.finditer(self.getHtml(bookurl))        alist=list(matchs)        total = len(alist)        print "total {0}".format(total)        i=0        for m in alist:            i+=1            text.append(m.group(2).decode("gb18030"))            text.append(self.getContent(m.group(1)))    self.writeFile(bookname,"\n\n".join(text))    del text[:]            print "{0}/{1}".format(i,total)        self.writeFile(bookname,"\n\n".join(text))        print "done!"    def writeFile(self,filename,text):        f=open(filename+".txt","a")        f.write(text)        f.close()    def getContent(self,url):        c=self.getHtml(url)                c=self.contentTag.search(c).group(1)        c=re.sub("<[^>]+?>","",c)        c=c.replace("nbsp;","").replace("&","")        return c.decode("gb18030")if __name__ == ‘__main__‘:    reload(sys)    sys.setdefaultencoding(‘utf-8‘)    spider = Spider()    spider.Run()

聲明一下，實在搞不定CSDN編輯器的格式問題了，上述代碼中：

self.writeFile(bookname,"\n\n".join(text))
del text[:]

這兩行是在for迴圈裡的，而不應該是與keywordfor對齊的。

上面不必要的import能夠刪掉。以小說《地球上唯一的魔法師》為例。aTag是匹配小說檔案夾全部章節的正則表達式，contentTag是匹配小說本文的正則表達式。

須要聲明一點，此代碼每抓取一章。就寫入檔案一次。以防記憶體佔用過大。

self.writeFile(bookname,"\n\n".join(text))
del text[:]

假設須要，也能夠抓取N章寫入檔案一次，僅僅需增加一個簡單的邏輯推斷就OK了。佔用多少記憶體和寫多少次檔案，每一個人有自己不同的衡量標準。

著作權聲明：本文部落格原創文章，部落格，未經同意，不得轉載。

Python抓取小說

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More