Compile a script for automatically downloading network Novels in Python

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Many websites have novels that are serialized or divided into chapters for online reading. However, if you want to download all the chapters and organize them into a well-formed text file, it is very effort-consuming. Fortunately, you can use the Python script to automatically complete all the work. The following two scripts will be used to demonstrate the historical novel "events of the Ming Dynasty-history should be well-written" serialized in Sina's blog.

In the first script, getlink. py is used to obtain the links of each chapter. Open the blog "" and click "all my articles". All the articles published by the author will be listed on the page, with the title shown as ".? Long .? Those things of the Ming Dynasty-history should be well-written \ [(\ d *)-(\ d *) \] "are the long chapters. All getlink. py needs to do is save these chapters and corresponding links to links. dat for backup.
Getlink. py

1 #-*-coding: UTF-8 -*-
2 import urllib, re, OS, sys, pickle
3 from xml. dom. minidom import parse
4 import xml. dom. minidom
5
6uid = '000000' # the ID of the moon
7
8 # Read chapters and corresponding links
9 chapters ={}# storage chapter and link, for example, chapters ['2017-000000'] = '/u/49861fd5010003ii'
10for I in range (1,100 ):
11 filehandle = urllib. urlopen ('HTTP: // blog.sina.com.cn/sns/service.php? M = aList & uid = % s & sort_id = 0 & page = % d' % (uid, I ))
12 myDoc = parse (filehandle)
13 myRss = myDoc. getElementsByTagName ("rss") [0]
14 items = myRss. getElementsByTagName ("item ")
15 for item in items:
16 title = item. getElementsByTagName ("title") [0]. childNodes [0]. data
17 link = item. getElementsByTagName ("link") [0]. childNodes [0]. data
18 match = re. search (ur '.? Long .? Those things in the Ming Dynasty-history should be well written \ [(\ d *)-(\ d *) \] ', title)
19 if match:
20 # print match. group (1), ":", match. group (2)
21 chapters ['% 04d-% 04d' % (int (match. group (1), int (match. group (2)] = item. getElementsByTagName ("link") [0]. childNodes [0]. data
22
23 # Save chapters to the file for backup
24 output = open ('links. dat ', 'wb + ')
25pickle. dump (chapters, output)
26output. close ()

. This script contains 19th lines of actual content (such as advertisement and script removal) from the downloaded content ). By analyzing the html source files of each article, we find that all the things we want are located in <div id = "articletextxxxxxxxxxxxxxx">... </div>, where xxxxxxxxxxxxxxxxxx is the link of the document. Therefore, you can use a regular expression to obtain the actual content of this article.

Bookdownload. py 1 #-*-coding: UTF-8 -*-
2 Import urllib, re, OS, sys, pickle
3 from XML. Dom. minidom import parse
4 Import XML. Dom. minidom
5
6uid = '000000' # the ID of the moon
7
8 # Read chapters and corresponding links
9 chapters ={}# storage chapter and link, for example, chapters ['2017-000000'] = '/u/49861fd5010003ii'
10 links = open ('links. dat ', 'rb +') # links. dat is generated by getlinks. py.
11 chapters = pickle. Load (LINKS) # Read chapter and link information from links. dat to chapters
12
13book1_open('mingthings.txt ', 'W +') Then mingthings.txt is the final full text to be generated
14for chapter in sorted (chapters ):
15 print chapter # output the Section currently being processed
16 webpage = urllib. urlopen ('HTTP: // blog.sina.com.cn '+ chapters [chapter]). Read (). Decode ('utf-8 ')
17
18 # S: dot match new line; I: Case insenstive; M: ^ $ match at linebreaks
19 match = Re. Search (UR '(? Silu). * <Div id = "articletext '+ chapters [chapter] [3:] +'". *?> (.*?) </Div>. * ', webpage)
20 if match:
21 text = match. Group (1) # obtain the content of each chapter
22
23 # organize the content of each chapter
24 text = Re. sub (UR '(? SLU) <(. *?)> ', '', Text)
25 text = Re. sub (UR '(? SLU) (& nbsp;) + ', '', text)
26 text = Re. sub (UR "(? Lum) ^ (+) "," ", text)
27 text = re. sub (ur '(? Lum) ^ (\ s +) ', '', text)
28 text = re. sub (ur '(? SiLu )(.? Long .? Those things in the Ming Dynasty-history should be well written \ [\ d *]) ', R' \ r \ n \ 1 \ r \ n', text)
29 text = re. sub (ur '(? Lum) ^ (. *) $ ', ur' \ 1 ', text)
30
31 book. write (text. encode ('gbk', 'ignore') + "\ r \ n ")
32 book. flush
33
34book. close ()

Run the above two scripts one by one every day. The latest full text of "those things in the Ming Dynasty-history should be well written" is on your hard disk.

In short, regular expressions are a powerful tool for downloading long chapters of online novels.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Compile a script for automatically downloading network Novels in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Compile a script for automatically downloading network Novels in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support