Nodejs cheerio module extracts HTML page content

Source: Internet
Author: User

Nodejs cheerio Module Extract HTML page contents table of Contents
    • 1. Nodejs cheerio module extracts HTML page content
      • 1.1. Find the target element
      • 1.2. Beautify Text output
      • 1.3. Extract the answer text
      • 1.4. Final Code

This article gives an example of using a Cheerio module to extract the specified content from an HTML file, and describes the specific steps, the APIs involved, and other Modules. The Cheerio module is a jquery-like module with similar apis, functionality, the ability to parse a Web page into the dom, and select elements through selector to set and get element attributes.

Here are the pages we want to parse:

The goal is to extract all the questions and answers from the task1-5 and save them in text Form. The resulting results are as follows. This is the topic text:

Task 1:you'll be given minutes to read the text for the first time and then choose a appropriate answer for each of The following questions.1. What ' s the passage mainly? A. How to learn online successfully. B. How to set up a learning Goal. C. the future of online learning. D. the benefits of online learning.2. Charles fruitlessly applied for job after job because of the following reasons EXCEPT ________  . A. he lacked in qualifications. B. he had no special training. C. He is too old and can ' t walk. D. He wasn ' t even able to do Office work.3. Weather has great __________ on our Health. A. Communicative. B. Effective. C. Student-centered. D. Teacher-centered.

Here is the answer text:

Task 1:1. D2. C3. C4. D5. A

Note: The answer is saved in the Web page, but it is not displayed in the Web Page.

1.1Find Target Element

The whole idea of extracting the question Text: first find all the elements that contain the topic, and then get the contents of those Elements. From Chrome's devtool (or firefox's firebug), All of the target elements are: all sibling nodes of the HR Element. Cheerio's Nextall function satisfies the requirement that this function obtains all subsequent sibling nodes of the current Node. The procedure is as Follows:

var fs = require (' fs '); var cheerio = require (' cheerio '); var myhtml = Fs.readfilesync ("a.html"); var $ = cheerio.load (myhtml); var t = $ (' HTML '). find (' hr '); var t2 = t.nextall (); t2.each (function(i, elem) {    getcontent ($ (  this ));    Console.log ($ (this). text ());});

The Web page is first read as a string, passed to the Cheerio.load function, and the return value is a Cheerio object (similar to a jquery object). Then use the Find function to find the HR element through Selector. Call the Nextall function again to get all the sibling nodes of the HR Element. finally, in each function, the text function prints the contents of all elements containing the Problem.

The result is garbled, the problem is because the FS module does not support Chinese. Decode to Chinese by Iconv-lite First. The modified code is as Follows:

varFS = require (' FS ');varCheerio = require (' cheerio '));varIconv = require (' iconv-lite ')); varmyhtml = Fs.readfilesync ("a.html");var$ = cheerio.load (iconv.decode (myhtml, ' GBK '));varT = $ (' HTML '). find (' HR '));vart2 =t.nextall (); T2.each (function(i, Elem) {getcontent ($ ( this)); Console.log ($ ( this). text ());});

The final result is as Follows:

           task 1:you'll be given minutes to read the text F Or the first time and then              & Nbsp;           choose an appropriate answer for each of the              Following questions.  1.      What does the "true gratitude" mean?      A.        A-to-life.      B.        A joyous Occasion.      C.            A much deeper level of gratitude.      D.               The improvement of the quality of Life.   2.      Who have so many things to being grateful for?      A.        A Successful Man.      B.        A miserable Person.      C.                A Good-tempered Man.      D.all of Us.           3. In the sentence ' expressing love and gratitude satisfies we deep sense of purpose ', The purpose includes all of the foll        Owing EXCEPT ________.                 A. Spiritual Health.               B. Emotional Health.                            C. Physical Health.                         D. Social Health.  4.                      What kind of gift was suitable for mother on mother ' s Day according to the author?              A. Carnation.               B. Lily.                            C. Accessories.                   D. Greeting Cards.  5. If A friend does you a favor, you can do all of the following EXCEPT ________.                   A. Treat him in a restaurant.             B. buy Ice-cream for Him.                     C. just say "thank you".          D. ask him to help next Time.

The above results have extra spaces, newline characters, and the output text looks messy, but at least the content is getting correct. Again in the task2-5 HTML file verification, also obtained the correct content, prove that the method is Feasible. next, we can focus on the problem of messy formatting.

1.2Beautify text output

The main problem is that there are extra spaces and line breaks. One idea is to trim the contents of all nodes (including text results), that is, all whitespace characters before and after last year, and add a newline character for the BR Element. It also simulates the render effect of the HTML document (because the display is correct in the browser, so the same method can be used to get the same result). To implement this method, to get all the child results of an element, using the Cheerio contents function, This function gets all the child elements of an element (including text elements). The Trim function of the string is then called to remove the trailing white Text. Because child elements have child elements, the recursive function is Used. The code is as Follows:

functiongetcontent (node) {varA =node.contents (); if(a.length = = 0) {        if(node.is (' BR ')) ) {RST+ = ' \ n '; } Else{RST+=node.text (). Trim ();; }    } Else{node.contents (). each (function(i, Elem) {getcontent ($ ( this));        }); if(node.is (' P ') | | | node.is (' TR ')) ) {RST+ = ' \ n '; }    }}

The GetContent function is used to get the text content of an element, and the input parameter is an element, and the function is called recursively. First call the contents function to get all the child Elements. If the number of child elements is 0, indicating that the element is a leaf node, first of all, if it is a BR element, then a newline character is added to the result, otherwise, the text function is called to get the textual content of the Element. If the child element is greater than 0, all child elements are processed recursively. If the current element is a P or TR element, a line break is added to the Result. Where RST is a global variable that holds the result Text. You need to set an empty string before calling the Function. After this processing, the results are as follows:

task 1:you would be given ten minutes to read the text for the first time and thenchoose an app Ropriate answer for each of the following questions.1. What does the "true gratitude" mean? A.A. B.A Joyous Occasion. C.A much deeper level of gratitude. D.the improvement of the quality of life.2. Who have so many things to being grateful for? A.A Successful Man. B.A Miserable Person. C.A Good-tempered Man. D.all of us.3.in The sentence "expressing love and gratitude satisfies we deep sense of purpose", the purpose includes a ll of the following EXCEPT ________. A.spiritual Health. B.emotional Health. C.physical Health. D.social health.4. What kind of gift are suitable for mother on mother ' s Day according to the author? A.carnation.b.lily.c.accessories.d.greeting cards.5. If A friend does you a favor, you can do all of the following EXCEPT ________. A.treat him in a restaurant. B.buy Ice-cream for Him. C.just say "thank you". D.ask him to help next Time. 

It looks a lot prettier. The problem text is extracted successfully, and then the answer text is Extracted.

1.3Extract answer text

In the HTML source file search answer, you can see that the answer is saved in script, as Follows:

<script language= "JavaScript" >  varnew  Array ()   =["C", "d", "d", "d", "d"  ]</SCRIPT>

The answer to the selection question is saved in an array of standardanswer. The answer text is obtained by obtaining the text of the code in the script element, then using the Eval function to get the array value and finally generating the answer Text. The code is as Follows:

var t = $ (' HTML '). find (' script '); var A = Undefined;t.each (function(i, elem) {    var text = $ (this). text ();     if (text.match (' standardanswer ')) {        var a = eval (text);        Console.log ("standardanswer:" + a);        A= a;    }});

Variable a holds an array of Answers. Determine if the text contains ' Standardanswer ' to see if it is a target code. The code is then passed to eval, and the return value is ' [' C ', ' d ', ' d ', ' d ', ' d '] ' this array. With an array of answers, it's easier to generate the answer Text.

1.4Final code

You can view the final code Here. The final code also solves some minor problems, such as the text of the problem contains superfluous text (for task4), and Task4 's answer is also displayed in the question text, leaving no blanks to fill in the Answers. The entire analysis and coding process is approximately 3 Hours. Where file a.js is used to generate the problem text, b.js is used to generate the answer Text. A.js and B.js have a lot of repetition (b.js is a direct copy of a a.js modification). This code basically solves a one-off problem, and there is no reusability (which is not considered in the process of writing). But most importantly: it solves the problem and it works. It doesn't need to be that good! finally, The program is used to process dozens of files, successfully generating the question text and the answer text correctly.

Author:astropeak

created:2016-12-18 Sun 16:36

Emacs 25.1.1 (Org Mode 8.2.10)

Validate

Nodejs cheerio module extracts HTML page content

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.