Code for node. js crawlers to capture data.
When cheerio is DOM-based and parsed
1. If the. text () method is used, there will be no html Entity Encoding Problems.
2. If the. html () method is used, it will appear in many cases (most of which are non-English). In this case, you may need to escape it.
This is because data storage is required, and all data needs to be converted.
Copy codeThe Code is as follows:
When there are too many threads, there will be too many threads. When there are too many threads, there will be too many threads between them.
Most of them are & # (x )? \ W + format
So use regular expression conversion.
Var body = .... // here is the returned data obtained after the request, or those. the obtained html () // can be converted to the standard unicode format first (if necessary, add: When the returned data is too much \ u and so on) body = unescape (body. replace (// \ u/g, "% u"); // escape the object character. // If x exists, the object is in hexadecimal notation, $1 indicates whether the matching has x, $2 indicates the content captured by the Second Matching bracket, and $2 is converted to body = body in hexadecimal notation. replace (/& # (x )? (\ W +);/g, function ($, $1, $2) {return String. fromCharCode (parseInt ($2, $1? 16: 10 ));});
OK ~
Of course, there are also many conversion versions on the Internet, which can be used.
Postscript:
When crawling web page data, the cheerio module is often used in the end, it is as convenient and fast as jq
(But some functions are not supported or in some form, such as jq's jQuery ('. myclass '). prop ('outerhtml '), cheerio is equivalent to jQuery.html ('. myclass') http://www.mgenware.com/blog? P = 2514)