Many websites have novels that are serialized or divided into chapters for online reading. However, if you want to download all the chapters and organize them into a well-formed text file, it is very effort-consuming. Fortunately, you can use the Python script to automatically complete all the work. The following two scripts will be used to demonstrate the historical novel "events of the Ming Dynasty-history should be well-written" serialized in Sina's blog.
In the first script, getlink. py is used to obtain the links of each chapter. Open the blog "" and click "all my articles". All the articles published by the author will be listed on the page, with the title shown as ".? Long .? Those things of the Ming Dynasty-history should be well-written \ [(\ d *)-(\ d *) \] "are the long chapters. All getlink. py needs to do is save these chapters and corresponding links to links. dat for backup.
Getlink. py
1 #-*-coding: UTF-8 -*-
2 import urllib, re, OS, sys, pickle
3 from xml. dom. minidom import parse
4 import xml. dom. minidom
5
6uid = '000000' # the ID of the moon
7
8 # Read chapters and corresponding links
9 chapters ={}# storage chapter and link, for example, chapters ['2017-000000'] = '/u/49861fd5010003ii'
10for I in range (1,100 ):
11 filehandle = urllib. urlopen ('HTTP: // blog.sina.com.cn/sns/service.php? M = aList & uid = % s & sort_id = 0 & page = % d' % (uid, I ))
12 myDoc = parse (filehandle)
13 myRss = myDoc. getElementsByTagName ("rss") [0]
14 items = myRss. getElementsByTagName ("item ")
15 for item in items:
16 title = item. getElementsByTagName ("title") [0]. childNodes [0]. data
17 link = item. getElementsByTagName ("link") [0]. childNodes [0]. data
18 match = re. search (ur '.? Long .? Those things in the Ming Dynasty-history should be well written \ [(\ d *)-(\ d *) \] ', title)
19 if match:
20 # print match. group (1), ":", match. group (2)
21 chapters ['% 04d-% 04d' % (int (match. group (1), int (match. group (2)] = item. getElementsByTagName ("link") [0]. childNodes [0]. data
22
23 # Save chapters to the file for backup
24 output = open ('links. dat ', 'wb + ')
25pickle. dump (chapters, output)
26output. close ()
. This script contains 19th lines of actual content (such as advertisement and script removal) from the downloaded content ). By analyzing the html source files of each article, we find that all the things we want are located in <div id = "articletextxxxxxxxxxxxxxx">... </div>, where xxxxxxxxxxxxxxxxxx is the link of the document. Therefore, you can use a regular expression to obtain the actual content of this article.
Bookdownload. py 1 #-*-coding: UTF-8 -*-
2 Import urllib, re, OS, sys, pickle
3 from XML. Dom. minidom import parse
4 Import XML. Dom. minidom
5
6uid = '000000' # the ID of the moon
7
8 # Read chapters and corresponding links
9 chapters ={}# storage chapter and link, for example, chapters ['2017-000000'] = '/u/49861fd5010003ii'
10 links = open ('links. dat ', 'rb +') # links. dat is generated by getlinks. py.
11 chapters = pickle. Load (LINKS) # Read chapter and link information from links. dat to chapters
12
13book1_open('mingthings.txt ', 'W +') Then mingthings.txt is the final full text to be generated
14for chapter in sorted (chapters ):
15 print chapter # output the Section currently being processed
16 webpage = urllib. urlopen ('HTTP: // blog.sina.com.cn '+ chapters [chapter]). Read (). Decode ('utf-8 ')
17
18 # S: dot match new line; I: Case insenstive; M: ^ $ match at linebreaks
19 match = Re. Search (UR '(? Silu). * <Div id = "articletext '+ chapters [chapter] [3:] +'". *?> (.*?) </Div>. * ', webpage)
20 if match:
21 text = match. Group (1) # obtain the content of each chapter
22
23 # organize the content of each chapter
24 text = Re. sub (UR '(? SLU) <(. *?)> ', '', Text)
25 text = Re. sub (UR '(? SLU) (& nbsp;) + ', '', text)
26 text = Re. sub (UR "(? Lum) ^ (+) "," ", text)
27 text = re. sub (ur '(? Lum) ^ (\ s +) ', '', text)
28 text = re. sub (ur '(? SiLu )(.? Long .? Those things in the Ming Dynasty-history should be well written \ [\ d *]) ', R' \ r \ n \ 1 \ r \ n', text)
29 text = re. sub (ur '(? Lum) ^ (. *) $ ', ur' \ 1 ', text)
30
31 book. write (text. encode ('gbk', 'ignore') + "\ r \ n ")
32 book. flush
33
34book. close ()
Run the above two scripts one by one every day. The latest full text of "those things in the Ming Dynasty-history should be well written" is on your hard disk.
In short, regular expressions are a powerful tool for downloading long chapters of online novels.