Recently because of the shortage of drama, the eldest brother chased the Archie art of a net play, by Dingmer of the same name of the novel "Beauty for Stuffing" adaptation, has released two seasons, although the entire play slot is full, but the boss to see the joy, and after the second season with me to the novel resources, directly to the original book to see the outcome ...
Conveniently search the next, are online resources, download words need to log in, sign up for a good trouble, write a reptile to play or play, so to do with node wrote a, here do the notes
Work flow
- Get list of URLs (Request Resource
request
module)
- According to the list of URLs to get the relevant page source code (May encounter page coding problems,
iconv-lite
modules)
- Source analysis, access to novel information (
cheerio
module)
- Save the novel information to the markdown file and add appropriate retouching and chapter information (write file
fs
, sync request Resource sync-request
module)
- Markdown to PDF (using Pandoc or Chrome print)
Get URLs
According to the navigation page of the novel, get the URL of all the chapters in the novel and store it in a JSON array.
- The preferred
http.get()
way to get the page source
- Get to the source code, print found in Chinese garbled, see found
charset = 'gbk'
that need to do transcoding
- Using the
iconv-lite
module for transcoding, Chinese display after the normal start to parse the source code, to obtain the required URL, in order to more easily parse, need to introduce cheerio
modules, cheerio
can be understood as running in the background of jquery, usage and jquery are very similar, familiar with jquery Students can quickly get started.
- Load the source into
cheerio
, analyzed the source code and learned that all chapter information is stored in the div
Package of a
Label, through thecheerio
Remove the Eligiblea
Tag group, Traverse, get chapter title and URL, save As object, save in array (because the URL stored in the link is not complete, so the store needs to be padded)
- Serializing an array of objects and writing them into a
list.json
file
var http = require ("http") var fs = require ("FS") var Cheerio = require ("Cheerio") var iconv = require ("Iconv-lite") var
url = ' http://www.17fa.com/files/article/html/90/90747/index.html ' http.get (URL, function (res) {//resource request var chunks = [] Res.on (' Data ', function (chunk) {Chunks.push (chunk)}) Res.on (' End ', function () {var html = Iconv.decode (Bu Ffer.concat (chunks), ' gb2312 ')//transcoding operation var $ = cheerio.load (HTML, {decodeentities:false}) var content = $ ("tbody") var links = [] $ (' div '). Children (' a '). each (function (i, elem) {var link = new Object () Li Nk.title = $ (this). Text () Link.link = ' http://www.17fa.com/files/article/html/90/90747/' + $ (this). attr (' href ')//padded URL Information if (i > 5) {links.push (link)}}) Fs.writefile ("List.json", json.stringify (links), function (Err) {if (!err) {console.log (write file succeeded)}}). On (' Error ', function () {CONSOLE.L
OG ("Web Access Error"})})
Example of a obtained column representation
[{"
title": "3 Forensic Division White",
"link": "Http://www.17fa.com/files/article/html/90/90747/16548771.html"
}, { "
title": "4 1th Dream",
"link": "Http://www.17fa.com/files/article/html/90/90747/16548772.html"
}, {
" Title ":" 5 police officer Han Shen, "
link": "Http://www.17fa.com/files/article/html/90/90747/16548773.html"
}, {
"title ":" 6 initial Battle ","
link ":" Http://www.17fa.com/files/article/html/90/90747/16548774.html "
}]
Get Data
With the list of URLs, the next job is very mechanical, traversing the list of URLs requesting resources, getting the source code, parsing the source, getting the novel, writing the file, but, because eventually all the chapters are saved in a file, to ensure the order of the chapters, so write files need to sync, in fact, When I coded, all the operations changed to sync.
Get the source
List.json
file, get to the list of URLs, traverse the list to get resources, because you need to ensure the order of the chapters, so introduce sync-request
module to synchronize request resource, request resources as usual transcoding
var http = require ("http")
var fs = require ("FS")
var cheerio = require ("Cheerio")
var iconv = require ("iconv- Lite ")
var request = require (' sync-request ')
var urllist = json.parse (Fs.readfilesync (' List.json ', ' UTF8 '))
function GetContent (chapter) {
var res = Request (' Get ', Chapter.link)
var html = iconv.decode (Res.body, ' gb2312 '//Get source code
} for
(Let i = 0; i < urllist.length; i++) {
getcontent (Urllist[i])
}
Analyze the source, get the novel
or through the cheerio
module to get the content of the novel, to avoid the impact of the impression, write operations before the removal of HTML tags in the content
Function GetContent (chapter) {
var res = Request (' Get ', Chapter.link)
var html = iconv.decode (res.body, ' gb2312 ')
var $ = cheerio.load (HTML, {
decodeentities:false
})
var content = ($ ("div#r1c"). Text ()). Replace (/\ /g, ")
}
Save a novel
Write operations also require synchronization, so the use of synchronous write functions fs.writeFileSync()
and synchronization Add functions fs.appendFileSync()
, the first write using the Write function, after the content is append operation, in order to improve the reading experience, each chapter before adding a title
You can also add a [TOC] as a navigation link before the content.
var http = require ("http") var fs = require ("FS") var Cheerio = require ("Cheerio") var iconv = require ("Iconv-lite") var Path = require (' path ') var urllist = Json.parse (Fs.readfilesync (' List.json ', ' UTF8 ')) function GetContent (chapter) {Con
Sole.log (Chapter.link) http.get (Chapter.link, function (res) {var chunks = [] res.on (' Data ', function (chunk) { Chunks.push (Chunk)}) Res.on (' End ', function () {var html = Iconv.decode (Buffer.concat (chunks), ' gb2312 ') var $ = cheerio.load (HTML, {decodeentities:false}) var content = ($ ("div#r1c"). Text ()). Rep
Lace (/\ /g, ') if (Fs.existssync (' Beauty is stuffing. MD ')) {Fs.appendfilesync (' Beauty for stuffing. MD ', ' ### ' + chapter.title)
Fs.appendfilesync (' Beauty for stuffing. "MD ', content)} else {Fs.writefilesync (' Beauty for stuffing. MD ', ' ### ' + chapter.title) Fs.appendfilesync (' The beauty is filling. md ', Content}}} '). On (' Error ', function () {Console.log ("crawl" + Chapter.lin K + "link Error!" ")})}}(Let i = 0; i < urllist.length; i++)
{Console.log (Urllist[i]) getcontent (Urllist[i])}
Markdown Turn PDF
I keep the novel in the Markdown file, in order to enhance the reading experience, you can convert the markdown file to PDF files, I now prefer the two ways, through the Chrome print function and Pandoc conversion
Chrome Print
Sublimetext has a plug-in markdown preview
, you can use Alt + m
the shortcut key in the Chrome preview Markdown, in the Chrome page, select Print, adjust the parameters, choose to Save as PDF, simple, rough, won my heart
Printing effect:
Pandoc Conversion
Pandoc is a very powerful file format conversion tool, you can convert markdown file into a variety of formats, tonight under the Windows10 toss for a long time, always retrieving less than Pdflatex, about Pandoc, will be dedicated to write a summary.
The PDF has been sent to the boss and is now looking at
About Python, node, crawler
For a long time ago, very want to use Python, very want to write a reptile, more want to use Python to write a reptile, even became a piece of the mind, with more comprehensive knowledge, the obsession also gradually fade away, a lot less "think", think more to do, practice the truth.
The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.