Node. js learning-using cheerio to capture webpage data

Source: Internet
Author: User

I plan to write an open-class website. If there is no data, I decided to go to the Netease open-class website to capture some data. I saw Node. js for a while, and Node. js is also suitable for doing this. I plan to use Node. js to capture data. The key is how to obtain the desired data after the webpage is captured? Then we found that cheerio is very convenient for parsing html, just like using jquery in a browser. Run the following command to install cheerio npm install cheerio Cheerio. First, let's take a look at the javascript code. This code can download the content of any webpage. Put it in curl. js and export it. Copy the code var http = require ("http"); // Utility function that downloads a URL and invokes // callback with the data. function download (url, callback) {http. get (url, function (res) {var data = ""; res. on ('data', function (chunk) {data + = chunk;}); res. on ("end", function () {callback (data );});}). on ("error", function () {callback (null) ;}) ;} exports. download = download; copy the code and use cheerio to parse html and find the desired data. Let's analyze the page first. We want to capture shards. The html code for one of the download buttons is as follows: <a class = "downbtn" href = "http://mov.bn.netease.com/mobilev/2013/1/F/G/S8KTEF7FG.mp4" id = "M8KTEKR84" target = "_ blank"> </a> we get the href attribute, you only need to select $ (". downbtn "). attr ("href"); in reality, we can. write the following code in js to copy the code var cheerio = require ("cheerio"); var server = require (". /curl "); var url =" http://v.163.com/special/opencourse/englishs1.html "server. download (url, function (data) {if (data ){ // Console. log (data); var $ = cheerio. load (data); $ (". downbtn "). each (function (I, e) {console. log ($ (e ). attr ("href") ;}); console. log ("done");} else {console. log ("error") ;}}); copy the code and execute node index. in this way, all the video addresses on the page can be printed in the Command box. For example, cheerio Chinese API [Reference] Tag examples we will use <ul id = "fruits"> <li class = "apple"> Apple </li> <li class = "orange "> Orange </li> <li class =" pear "> Pear </li> </ul>. This is the HTML Tag Loading that we will use in all API examples. First you need to load HTML. This step is required for jQuery, since jQuery operates on the one, baked-in DOM. Through Cheerio, We need to upload the HTML document. This is the first choice: var cheerio = require ('cheerio '), $ = cheerio. load ('<ul id = "fruits">... </ul> '); or pass a string as the content to load HTML: $ = require ('cheerio'); $ ('ul ', '<ul id = "fruits">... </ul> '); Or as the root: $ = require ('cheerio'); $ ('lil', 'U ', '<ul id = "fruits">... </ul> '); you can also pass an additional object. load () if you need to change any default resolution options: $ = cheerio. load ('<ul id = "fruits">... </ul> ', {ignoreWhitespace: true, xmlMode: true}); these resolution options are directly from htmlparser, so any The valid options in arser also work in Chreeio. The default options are as follows: {ignoreWhitespace: false, xmlMode: false, lowerCaseTags: false} SelectorsCheerio's selector is almost the same as jQuery, so the API is similar. $ (Selectior, [context], [root]) the selector searches within the Context range, and the Context searches within the Root range. Selector and context are strings, DOM elements, arrays of DOM elements, or chreeio objects. Root is a string of HTML documents. Copy code $ ('. apple ',' # fruits '). text () // => Apple $ ('ul. pear '). attr ('class') // => pear certificate ('livenclass+orange+'{.html () // => <li class = "orange"> Orange </li> copy the Attributes code to obtain and modify Attributes. attr (name, value) obtains and modifies attributes. Only attributes of the first element can be obtained in matching elements. If the value of an attribute is set to null, this attribute is removed. You can also pass a key value or a function. $ ('Ul '). attr ('id') // => fruits certificate ('.apple'0000.attr('id', 'favorite'0000.html () // => <li class = "apple" id = "favorite"> Apple </li>. val ([value]) obtains and modifies the value of input, select, textarea. note: The support for passing key values and functions has not yet been added. $ ('Input [type = "text"] '). val () // => input_text $ ('input [type = "text" {'}.val('test'}.html () // => <input type = "text" value = "test"/>. removeAttr (name) deletes attributes ('.pear'mirror.removeattr('class'hangzhou.html () // => <li> Pear </li> by name. hasClass (className) checks whether the matching element has the given Class Name copy code $ ('. pear '). hasClass ('pear ') // => true $ ('apple '). hasClass ('fruit') // => false $ ('lil '). hasClass ('pear ') // => true: copy the code. addClass (className) adds class (es) to all You can also upload functions. Certificate ('.pear'{.addclass('fruit'{.html () // => <li class = "pear fruit"> Pear </li> $ ('. apple '). addClass ('fruit red'0000.html () // => <li class = "apple fruit red"> Apple </li>. removeClass ([className]) removes one or more classes separated by spaces from the selected elements. If the className is not defined, all the classes will be removed and functions can be passed. Certificate ('.pear').removeclass('pear'mirror.html () // => <li class = ""> Pear </li> trim () // => <li class = "> Apple </li>. toggleClass (className, [switch]) adds or deletes a class, depending on whether the class exists. $ ('. apple. green '). toggleClass ('fruit green red'0000.html () // => <li class = "apple fruit red"> Apple </li> $ ('. apple. green '). toggleClass ('fruit green red', trueapps.html () // => <li class = "apple green fruit re D "> Apple </li>. is (selector). is (element). is (selection). is (function (index) returns true if any element matches selector. If a determining function is used, the determining function is executed in the selected element, so this points to the current element. Traversing. find (selector) gets the descendant of a matched element filtered by the selector. $ ('# Fruits '). find ('lil '). length // => 3. parent ([selector]) obtains the parent of each matching element and can be selectively filtered by selector. $ ('. Pear '). parent (). attr ('id') // => fruits. parents ([selector]) obtains the parent set of elements matched by selector filtering. $ ('. Orange '). parents (). length // => 2 $ ('. orange '). parents ('# fruits '). length // => 1. closest ([selector]) obtains the first matched element $ ('. orange '). closest () // => [] $ ('. orange '). closest ('. apple ') // => [] $ ('. orange '). closest ('lil') // => [<li class = "orange"> Orange </li>] $ ('. orange '). closest ('# fruits') // => [<ul id = "fruits">... </ul>]. next () Get the same level element after the first element $ ('. apple '). next (). hasClass ('ora Nge') // => true. nextAll () obtains all the same level elements after this element $ ('. apple '). nextAll () // => [<li class = "orange"> Orange </li>, <li class = "pear"> Pear </li>]. prev () gets the first sibling element before this element $ ('. orange '). prev (). hasClass ('apple') // => true. preAll () $ ('. pear '). prevAll () // => [<li class = "orange"> Orange </li>, <li class = "apple"> Apple </li>] obtains all the same level elements before the element. slice (start, [end]) gets the elements in the selected range $ ('lil '). slice (1 ). eq (0 ). text () // => 'Orange '$ ('lil '). slice (1, 2 ). Length // => 1. siblings (selector) obtains the selected peer element, excluding itself $ ('. pear '). siblings (). length // => 2 $ ('. pear '). siblings ('. orange '). length // => 1. children (selector) gets the child element of the selected element $ ('# fruits '). children (). length // => 3 $ ('# fruits '). children ('. pear '). text () // => Pear. each (function (index, element) iterates a cheerio object and executes a function for each matching element. When the callback is fired, the function is fired in the context of the DOM element, so this refers to the current element, which is equivalent to the function parameter element. to jump out of the loop early, false is returned. copy the code var fruits = []; $ ('lil '). each (function (I, elem) {fruits [I] = $ (this ). text () ;}); fruits. join (','); // => Apple, Orange, Pear copy code. map (function (index, element) iterates a cheerio object and executes a function for each matching element. Map returns an array of iteration results. The function is fired in the context of the DOM element, so this refers to the current element, which is equivalent to the function parameter element $ ('lil '). map (function (I, el) {// this = elreturn $ (this ). attr ('class ');}). join (','); // => apple, orange, pear. filter (selector ). filter (function (index) iterates a cheerio object to filter out the elements that match the selector or the function that is passed in. If the function method is used, this function is executed in the selected element, so this points to the current element of the gesture. Selector: $ ('lil '). filter ('. orange '). attr ('class'); // => orangeFunction: $ ('lil '). filter (function (I, el) {// this = elreturn $ (this ). attr ('class') = 'Orange ';}). attr ('class') // => orange. first () selects the first element of the chreeio object $ ('# fruits '). children (). first (). text () // => Apple. last () $ ('# fruits '). children (). last (). text () // => Pear selects the last element of the chreeio object. eq (I) filters matching elements through indexes. Use. eq (-I) to start from the last element. $ ('Lil '). eq (0 ). text () // => Apple $ ('lil '). eq (-1 ). text () // => PearManipulation Method for changing the DOM structure. append (content, [content...]) insert a child element at the end of each element to copy the code $ ('ul '). append ('<li class = "plum"> Plum </li> '{}.html () /// <ul id = "fruits"> // <li class = "apple"> Apple </li> // <li class = "orange"> Orange </ li> // <li class = "pear"> Pear </li> // <li class = "plum"> Plum </li> // </ul> copy the code. prepend (content, [content,...]) insert a child element at the beginning of each element to copy the code $ ('ul '). p Repend ('<li class = "plum"> Plum </li> '{}.html () /// <ul id = "fruits"> // <li class = "plum"> Plum </li> // <li class = "apple"> Apple </ li> // <li class = "orange"> Orange </li> // <li class = "pear"> Pear </li> // </ul> copy the code. after (content, [content,...]) insert an element after each matching element to copy the code $ ('. apple '). after ('<li class = "plum"> Plum </li> '{}.html () /// <ul id = "fruits"> // <li class = "apple"> Apple </li> // <li class = "plum"> Plum </ li> // <li class = "o Range "> Orange </li> // <li class =" pear "> Pear </li> // </ul> copies the code. before (content, [content,...]) insert an element before each matching element to copy the code $ ('. apple '). before ('<li class = "plum"> Plum </li> '{}.html () /// <ul id = "fruits"> // <li class = "plum"> Plum </li> // <li class = "apple"> Apple </ li> // <li class = "orange"> Orange </li> // <li class = "pear"> Pear </li> // </ul> copy the code. remove ([selector]) removes matching elements and their child elements from the DOM. The selector is used to filter the elements to be deleted. Certificate ('.pear').remove().html () /// <ul id = "fruits"> // <li class = "apple"> Apple </li> // <li class = "orange"> Orange </ li> // </ul>. replaceWith (content) Replace the matched element copy code var plum =$ ('<li class = "plum"> Plum </li> '{}('.pear'}.replacewith(plum%}.html () /// <ul id = "fruits"> // <li class = "apple"> Apple </li> // <li class = "orange"> Orange </ li> // <li class = "plum"> Plum </li> // </ul> copy the code. empty () clears an element and removes all child elements $ ('ul '). empty (<Ul id = "fruits"> </ul>. html ([htmlString]) to obtain the HTML string of the element. If htmlString contains content, it will replace the original HTML response ('.orange'}.html () // => Orange response ('{fruits'}.html ('<li class = "mango"> Mango </li> '{.html () // => <li class = "mango"> Mango </li>. text ([textString]) obtains the text content of an element, including child elements. If textString is specified, the text content of each element is replaced. Copy code $ ('. orange '). text () // => Orange $ ('ul '). text () // => Apple // Orange // Pear copy code Rendering if you want to present a document, you can use the html multi-utility function. Developer.html () /// <ul id = "fruits"> // <li class = "apple"> Apple </li> // <li class = "orange"> Orange </ li> // <li class = "pear"> Pear </li> // </ul> if you want to present outerHTML, you can use example. html (selector) example. html ('. pear ') // => <li class = "pear"> Pear </li> By default, html will leave some tags open. sometimes you may instead want to render a valid XML document. for example, you might parse the following XML snippet: $ = cheerio. load ('<med Ia: thumbnail url = "http://www.foo.com/keyframe.jpg" width = "75" height = "50" time = "12:05:01. 123 "/> ');... and later want to render to XML. to do this, you can use the 'xml' utility function: $. xml () // => <media: thumbnail url = "http://www.foo.com/keyframe.jpg" width = "75" height = "50" time = "12:05:01. 123 "/> Miscellaneous does not belong to DOM Element Methods elsewhere. toArray () Retrieves all DOM elements and converts them to arrays and $ ('lil '). toArray () // => [{...}, {...}, {... }]. Clone () clone cheerio object var moreFruit =$ ('# fruits '). clone () Utilities $. root Sometimes you need to work with the top-level root element. to query it, you can use $. root (). $. root (). append ('<ul id = "vegetables"> </ul> 'contents .html (); // => <ul id = "fruits">... </ul> <ul id = "vegetables"> </ul> $. contains (contained) check whether the cotained element is a child element of the container element $. parseHTML (data [, context] [, keepScripts]) parses the string DOM node array. The context parameter has no significance for chreeio, but is used to maintain APi compatibility.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.