Node.js realization of simple novel reptile instance _node.js

Source: Internet
Author: User
Tags html tags require

Recently because of the shortage of drama, the eldest brother chased the Archie art of a net play, by Dingmer of the same name of the novel "Beauty for Stuffing" adaptation, has released two seasons, although the entire play slot is full, but the boss to see the joy, and after the second season with me to the novel resources, directly to the original book to see the outcome ...

Conveniently search the next, are online resources, download words need to log in, sign up for a good trouble, write a reptile to play or play, so to do with node wrote a, here do the notes

Work flow

  • Get list of URLs (Request Resource request module)
  • According to the list of URLs to get the relevant page source code (May encounter page coding problems, iconv-lite modules)
  • Source analysis, access to novel information ( cheerio module)
  • Save the novel information to the markdown file and add appropriate retouching and chapter information (write file fs , sync request Resource sync-request module)
  • Markdown to PDF (using Pandoc or Chrome print)

Get URLs

According to the navigation page of the novel, get the URL of all the chapters in the novel and store it in a JSON array.

  • The preferred http.get() way to get the page source
  • Get to the source code, print found in Chinese garbled, see found charset = 'gbk' that need to do transcoding
  • Using the iconv-lite module for transcoding, Chinese display after the normal start to parse the source code, to obtain the required URL, in order to more easily parse, need to introduce cheerio modules, cheerio can be understood as running in the background of jquery, usage and jquery are very similar, familiar with jquery Students can quickly get started.
  • Load the source intocheerio, analyzed the source code and learned that all chapter information is stored in the divPackage of a Label, through thecheerio Remove the Eligiblea Tag group, Traverse, get chapter title and URL, save As object, save in array (because the URL stored in the link is not complete, so the store needs to be padded)
  • Serializing an array of objects and writing them into a list.json file
var http = require ("http") var fs = require ("FS") var Cheerio = require ("Cheerio") var iconv = require ("Iconv-lite") var
  url = ' http://www.17fa.com/files/article/html/90/90747/index.html ' http.get (URL, function (res) {//resource request var chunks = [] Res.on (' Data ', function (chunk) {Chunks.push (chunk)}) Res.on (' End ', function () {var html = Iconv.decode (Bu Ffer.concat (chunks), ' gb2312 ')//transcoding operation var $ = cheerio.load (HTML, {decodeentities:false}) var content = $ ("tbody") var links = [] $ (' div '). Children (' a '). each (function (i, elem) {var link = new Object () Li  Nk.title = $ (this). Text () Link.link = ' http://www.17fa.com/files/article/html/90/90747/' + $ (this). attr (' href ')//padded  URL Information if (i > 5) {links.push (link)}}) Fs.writefile ("List.json", json.stringify (links), function (Err) {if (!err) {console.log (write file succeeded)}}). On (' Error ', function () {CONSOLE.L
 OG ("Web Access Error"})})

Example of a obtained column representation

[{"
  title": "3 Forensic Division White",
  "link": "Http://www.17fa.com/files/article/html/90/90747/16548771.html"
}, { "
  title": "4 1th Dream",
  "link": "Http://www.17fa.com/files/article/html/90/90747/16548772.html"
}, {
  " Title ":" 5 police officer Han Shen, "
  link": "Http://www.17fa.com/files/article/html/90/90747/16548773.html"
}, {
  "title ":" 6 initial Battle ","
  link ":" Http://www.17fa.com/files/article/html/90/90747/16548774.html "
}]

Get Data

With the list of URLs, the next job is very mechanical, traversing the list of URLs requesting resources, getting the source code, parsing the source, getting the novel, writing the file, but, because eventually all the chapters are saved in a file, to ensure the order of the chapters, so write files need to sync, in fact, When I coded, all the operations changed to sync.

Get the source

List.json file, get to the list of URLs, traverse the list to get resources, because you need to ensure the order of the chapters, so introduce sync-request module to synchronize request resource, request resources as usual transcoding

var http = require ("http")
var fs = require ("FS")
var cheerio = require ("Cheerio")
var iconv = require ("iconv- Lite ")
var request = require (' sync-request ')
var urllist = json.parse (Fs.readfilesync (' List.json ', ' UTF8 '))
function GetContent (chapter) {
  var res = Request (' Get ', Chapter.link)
  var html = iconv.decode (Res.body, ' gb2312 '//Get source code
} for
(Let i = 0; i < urllist.length; i++) {
  getcontent (Urllist[i])
}

Analyze the source, get the novel

or through the cheerio module to get the content of the novel, to avoid the impact of the impression, write operations before the removal of HTML tags in the content

Function GetContent (chapter) {
  var res = Request (' Get ', Chapter.link)
  var html = iconv.decode (res.body, ' gb2312 ') 
  var $ = cheerio.load (HTML, {
    decodeentities:false
  })
  var content = ($ ("div#r1c"). Text ()). Replace (/\ /g, ")
}

Save a novel

Write operations also require synchronization, so the use of synchronous write functions fs.writeFileSync() and synchronization Add functions fs.appendFileSync() , the first write using the Write function, after the content is append operation, in order to improve the reading experience, each chapter before adding a title

You can also add a [TOC] as a navigation link before the content.

var http = require ("http") var fs = require ("FS") var Cheerio = require ("Cheerio") var iconv = require ("Iconv-lite") var Path = require (' path ') var urllist = Json.parse (Fs.readfilesync (' List.json ', ' UTF8 ')) function GetContent (chapter) {Con
      Sole.log (Chapter.link) http.get (Chapter.link, function (res) {var chunks = [] res.on (' Data ', function (chunk) { Chunks.push (Chunk)}) Res.on (' End ', function () {var html = Iconv.decode (Buffer.concat (chunks), ' gb2312 ') var $ = cheerio.load (HTML, {decodeentities:false}) var content = ($ ("div#r1c"). Text ()). Rep
        Lace (/\ /g, ') if (Fs.existssync (' Beauty is stuffing. MD ')) {Fs.appendfilesync (' Beauty for stuffing. MD ', ' ### ' + chapter.title)
        Fs.appendfilesync (' Beauty for stuffing. "MD ', content)} else {Fs.writefilesync (' Beauty for stuffing. MD ', ' ### ' + chapter.title) Fs.appendfilesync (' The beauty is filling. md ', Content}}} '). On (' Error ', function () {Console.log ("crawl" + Chapter.lin K + "link Error!" ")})}}(Let i = 0; i < urllist.length; i++)
 {Console.log (Urllist[i]) getcontent (Urllist[i])}

Markdown Turn PDF

I keep the novel in the Markdown file, in order to enhance the reading experience, you can convert the markdown file to PDF files, I now prefer the two ways, through the Chrome print function and Pandoc conversion

Chrome Print

Sublimetext has a plug-in markdown preview , you can use Alt + m the shortcut key in the Chrome preview Markdown, in the Chrome page, select Print, adjust the parameters, choose to Save as PDF, simple, rough, won my heart

Printing effect:

Pandoc Conversion
Pandoc is a very powerful file format conversion tool, you can convert markdown file into a variety of formats, tonight under the Windows10 toss for a long time, always retrieving less than Pdflatex, about Pandoc, will be dedicated to write a summary.

The PDF has been sent to the boss and is now looking at

About Python, node, crawler

For a long time ago, very want to use Python, very want to write a reptile, more want to use Python to write a reptile, even became a piece of the mind, with more comprehensive knowledge, the obsession also gradually fade away, a lot less "think", think more to do, practice the truth.

The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.