Code for node. js crawlers to capture data.

Source: Internet
Author: User

Code for node. js crawlers to capture data.

When cheerio is DOM-based and parsed

1. If the. text () method is used, there will be no html Entity Encoding Problems.

2. If the. html () method is used, it will appear in many cases (most of which are non-English). In this case, you may need to escape it.

This is because data storage is required, and all data needs to be converted.

Copy codeThe Code is as follows:
When there are too many threads, there will be too many threads. When there are too many threads, there will be too many threads between them.

Most of them are & # (x )? \ W + format

So use regular expression conversion.

Var body = .... // here is the returned data obtained after the request, or those. the obtained html () // can be converted to the standard unicode format first (if necessary, add: When the returned data is too much \ u and so on) body = unescape (body. replace (// \ u/g, "% u"); // escape the object character. // If x exists, the object is in hexadecimal notation, $1 indicates whether the matching has x, $2 indicates the content captured by the Second Matching bracket, and $2 is converted to body = body in hexadecimal notation. replace (/& # (x )? (\ W +);/g, function ($, $1, $2) {return String. fromCharCode (parseInt ($2, $1? 16: 10 ));});

OK ~

Of course, there are also many conversion versions on the Internet, which can be used.

Postscript:

When crawling web page data, the cheerio module is often used in the end, it is as convenient and fast as jq

(But some functions are not supported or in some form, such as jq's jQuery ('. myclass '). prop ('outerhtml '), cheerio is equivalent to jQuery.html ('. myclass') http://www.mgenware.com/blog? P = 2514)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.