Node. js crawlers crawl garbled data. node. js crawlers crawl garbled data.
1. Non-UTF-8 page processing.
1. Background
Windows-1251 Encoding
Such as Russian site: https://vk.com/cciinniikk
Shameful discovery is this encoding
Here we mainly talk about the problems of Windows-1251 (cp1251) encoding and UTF-8 encoding. Other problems such as gbk will not be taken into account first ~
2. Solution
1.
Use js native encoding for conversion
But I have not found a solution yet ..
If it is UTF-8 to window-1251 can also be http://stackoverflow.com/questions/2696481/encoding-conversation-utf-8-to-1251-in-javascript
var DMap = {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99, 100: 100, 101: 101, 102: 102, 103: 103, 104: 104, 105: 105, 106: 106, 107: 107, 108: 108, 109: 109, 110: 110, 111: 111, 112: 112, 113: 113, 114: 114, 115: 115, 116: 116, 117: 117, 118: 118, 119: 119, 120: 120, 121: 121, 122: 122, 123: 123, 124: 124, 125: 125, 126: 126, 127: 127, 1027: 129, 8225: 135, 1046: 198, 8222: 132, 1047: 199, 1168: 165, 1048: 200, 1113: 154, 1049: 201, 1045: 197, 1050: 202, 1028: 170, 160: 160, 1040: 192, 1051: 203, 164: 164, 166: 166, 167: 167, 169: 169, 171: 171, 172: 172, 173: 173, 174: 174, 1053: 205, 176: 176, 177: 177, 1114: 156, 181: 181, 182: 182, 183: 183, 8221: 148, 187: 187, 1029: 189, 1056: 208, 1057: 209, 1058: 210, 8364: 136, 1112: 188, 1115: 158, 1059: 211, 1060: 212, 1030: 178, 1061: 213, 1062: 214, 1063: 215, 1116: 157, 1064: 216, 1065: 217, 1031: 175, 1066: 218, 1067: 219, 1068: 220, 1069: 221, 1070: 222, 1032: 163, 8226: 149, 1071: 223, 1072: 224, 8482: 153, 1073: 225, 8240: 137, 1118: 162, 1074: 226, 1110: 179, 8230: 133, 1075: 227, 1033: 138, 1076: 228, 1077: 229, 8211: 150, 1078: 230, 1119: 159, 1079: 231, 1042: 194, 1080: 232, 1034: 140, 1025: 168, 1081: 233, 1082: 234, 8212: 151, 1083: 235, 1169: 180, 1084: 236, 1052: 204, 1085: 237, 1035: 142, 1086: 238, 1087: 239, 1088: 240, 1089: 241, 1090: 242, 1036: 141, 1041: 193, 1091: 243, 1092: 244, 8224: 134, 1093: 245, 8470: 185, 1094: 246, 1054: 206, 1095: 247, 1096: 248, 8249: 139, 1097: 249, 1098: 250, 1044: 196, 1099: 251, 1111: 191, 1055: 207, 1100: 252, 1038: 161, 8220: 147, 1101: 253, 8250: 155, 1102: 254, 8216: 145, 1103: 255, 1043: 195, 1105: 184, 1039: 143, 1026: 128, 1106: 144, 8218: 130, 1107: 131, 8217: 146, 1108: 186, 1109: 190}function UnicodeToWin1251(s) { var L = [] for (var i=0; i<s.length; i++) { var ord = s.charCodeAt(i) if (!(ord in DMap)) throw "Character "+s.charAt(i)+" isn't supported by win1251!" L.push(String.fromCharCode(DMap[ord])) } return L.join('')}
Well, this is a good method. Dmap stores the ing between the window-1251 encoding and unicode.
So I was planning to do this by turning it back.
However, the charCodeAt method is only valid for unicode. How do I mine the code segments of other encodings? Because nodejs is used, the corresponding module is considered.
2.
Installation using nodejs module iconv-lite instructions for use see https://www.npmjs.com/package/iconv-lite
Use this method as needed.
Var iconv = require ('iconv-lite '); var Buffer = require ('buffer '). buffer; // Convert from an encoded windows-1251 to UTF-8 // This str1 should be http. the data returned by a get or request must contain parameters. Otherwise, an error may occur. // remember to use encoding in addition to basic parameters: the 'binary 'parameter // For example, str1 = 'commandid was used when there were too many commandid'; // converts the obtained data to a Buffer, remember that the format uses binary // binary to directly communicate with each encoding ~ Var buf = new Buffer (str1, 'binary '); var str2 = iconv. decode (buf, 'win1251 '); // str2 is converted. By default, it is converted to the Unicode format. It is estimated that this is also the original intention of iconv-lite console. log (str2 );
3.
Install and use nodejs module iconv instructions for use see https://github.com/bnoordhuis/node-iconv
(In fact, the essence is to install a node-gyp. I didn't carefully read the official instructions before)
After simple use, it is still garbled like: please refer to the following link for more information: when there are too many threads, there are too many threads, too many threads, too many threads
Http://stackoverflow.com/questions/8693400/nodejs-convertinf-from-windows-1251-to-utf-8
Solution: convert data to binary to encoding: binary (the default encoding is UTF-8)
request({ uri: website_url, method: 'GET', encoding: 'binary'}, function (error, response, body) { body = new Buffer(body, 'binary'); conv = new iconv.Iconv('WINDOWS-1251', 'utf8'); body = conv.convert(body).toString(); }});
--> Also want to say is, iconv needs some environment dependencies when using, see Official Instructions: https://github.com/TooTallNate/node-gyp
Therefore:
First, the python version (such as 2.7) is required;
Second, support for compilation tools is required (the most error occurs in windows)
Error similar to this
Node. If there is no specific version or a later version, the vs2005 compilation tool is used by default (so the solution to the error prompt is generally based on vs2005 and framwork sdk2.0)
Solution:
1. Install visual stutio 2010
2. Specify the vs compilation tool Version (if vs2012 is 2012)
(This command is automatically specified in some cases, and does not necessarily require the npm config set msvs_version 2010 -- global Command)
3. If you still cannot find the framwork sdk, you can add the installation path to the path of the system environment variable.
(2010 corresponds to sdk4.0, similar to 2008 sdj3.5 2012 sdk4.5 ?)
Remember that the environment variable will only read the first one!
For example, if you have already set the path of SDK2.0 to the system environment variable, then when you add a path of SDK4.0, only the first one works.
Therefore:
Or delete the previous one.
Or put the path you want to add in front of it.
Ii. gzip page processing
Sometimes we find that the browser accesses the page normally, but the Response from the simulated request is garbled. You can check the browser request's Response information. If there is Content-Encoding: gzip, most likely because the page is compressed by gzip, the following parameters need to be added during the request:
Gzip: true
The above is all the content of this article. I hope you will like it.