Compile a script for automatically downloading network Novels in Python

Source: Internet
Author: User
Many websites have novels that are serialized or divided into chapters for online reading. However, if you want to download all the chapters and organize them into a well-formed text file, it is very effort-consuming. Fortunately, you can use the Python script to automatically complete all the work. The following two scripts will be used to demonstrate the historical novel "events of the Ming Dynasty-history should be well-written" serialized in Sina's blog.

In the first script, getlink. py is used to obtain the links of each chapter. Open the blog "" and click "all my articles". All the articles published by the author will be listed on the page, with the title shown as ".? Long .? Those things of the Ming Dynasty-history should be well-written \ [(\ d *)-(\ d *) \] "are the long chapters. All getlink. py needs to do is save these chapters and corresponding links to links. dat for backup.
Getlink. py

1 #-*-coding: UTF-8 -*-
2 import urllib, re, OS, sys, pickle
3 from xml. dom. minidom import parse
4 import xml. dom. minidom
5
6uid = '000000' # the ID of the moon
7
8 # Read chapters and corresponding links
9 chapters ={}# storage chapter and link, for example, chapters ['2017-000000'] = '/u/49861fd5010003ii'
10for I in range (1,100 ):
11 filehandle = urllib. urlopen ('HTTP: // blog.sina.com.cn/sns/service.php? M = aList & uid = % s & sort_id = 0 & page = % d' % (uid, I ))
12 myDoc = parse (filehandle)
13 myRss = myDoc. getElementsByTagName ("rss") [0]
14 items = myRss. getElementsByTagName ("item ")
15 for item in items:
16 title = item. getElementsByTagName ("title") [0]. childNodes [0]. data
17 link = item. getElementsByTagName ("link") [0]. childNodes [0]. data
18 match = re. search (ur '.? Long .? Those things in the Ming Dynasty-history should be well written \ [(\ d *)-(\ d *) \] ', title)
19 if match:
20 # print match. group (1), ":", match. group (2)
21 chapters ['% 04d-% 04d' % (int (match. group (1), int (match. group (2)] = item. getElementsByTagName ("link") [0]. childNodes [0]. data
22
23 # Save chapters to the file for backup
24 output = open ('links. dat ', 'wb + ')
25pickle. dump (chapters, output)
26output. close ()

. This script contains 19th lines of actual content (such as advertisement and script removal) from the downloaded content ). By analyzing the html source files of each article, we find that all the things we want are located in <div id = "articletextxxxxxxxxxxxxxx">... </div>, where xxxxxxxxxxxxxxxxxx is the link of the document. Therefore, you can use a regular expression to obtain the actual content of this article.

Bookdownload. py 1 #-*-coding: UTF-8 -*-
2 Import urllib, re, OS, sys, pickle
3 from XML. Dom. minidom import parse
4 Import XML. Dom. minidom
5
6uid = '000000' # the ID of the moon
7
8 # Read chapters and corresponding links
9 chapters ={}# storage chapter and link, for example, chapters ['2017-000000'] = '/u/49861fd5010003ii'
10 links = open ('links. dat ', 'rb +') # links. dat is generated by getlinks. py.
11 chapters = pickle. Load (LINKS) # Read chapter and link information from links. dat to chapters
12
13book1_open('mingthings.txt ', 'W +') Then mingthings.txt is the final full text to be generated
14for chapter in sorted (chapters ):
15 print chapter # output the Section currently being processed
16 webpage = urllib. urlopen ('HTTP: // blog.sina.com.cn '+ chapters [chapter]). Read (). Decode ('utf-8 ')
17
18 # S: dot match new line; I: Case insenstive; M: ^ $ match at linebreaks
19 match = Re. Search (UR '(? Silu). * <Div id = "articletext '+ chapters [chapter] [3:] +'". *?> (.*?) </Div>. * ', webpage)
20 if match:
21 text = match. Group (1) # obtain the content of each chapter
22
23 # organize the content of each chapter
24 text = Re. sub (UR '(? SLU) <(. *?)> ', '', Text)
25 text = Re. sub (UR '(? SLU) (& nbsp;) + ', '', text)
26 text = Re. sub (UR "(? Lum) ^ (+) "," ", text)
27 text = re. sub (ur '(? Lum) ^ (\ s +) ', '', text)
28 text = re. sub (ur '(? SiLu )(.? Long .? Those things in the Ming Dynasty-history should be well written \ [\ d *]) ', R' \ r \ n \ 1 \ r \ n', text)
29 text = re. sub (ur '(? Lum) ^ (. *) $ ', ur' \ 1 ', text)
30
31 book. write (text. encode ('gbk', 'ignore') + "\ r \ n ")
32 book. flush
33
34book. close ()

Run the above two scripts one by one every day. The latest full text of "those things in the Ming Dynasty-history should be well written" is on your hard disk.

In short, regular expressions are a powerful tool for downloading long chapters of online novels.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.