Node. js crawlers crawl garbled data. node. js crawlers crawl garbled data.

Source: Internet
Author: User

Node. js crawlers crawl garbled data. node. js crawlers crawl garbled data.

1. Non-UTF-8 page processing.

1. Background

Windows-1251 Encoding

Such as Russian site: https://vk.com/cciinniikk

Shameful discovery is this encoding

Here we mainly talk about the problems of Windows-1251 (cp1251) encoding and UTF-8 encoding. Other problems such as gbk will not be taken into account first ~

2. Solution

1.

Use js native encoding for conversion

But I have not found a solution yet ..

If it is UTF-8 to window-1251 can also be http://stackoverflow.com/questions/2696481/encoding-conversation-utf-8-to-1251-in-javascript

var DMap = {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99, 100: 100, 101: 101, 102: 102, 103: 103, 104: 104, 105: 105, 106: 106, 107: 107, 108: 108, 109: 109, 110: 110, 111: 111, 112: 112, 113: 113, 114: 114, 115: 115, 116: 116, 117: 117, 118: 118, 119: 119, 120: 120, 121: 121, 122: 122, 123: 123, 124: 124, 125: 125, 126: 126, 127: 127, 1027: 129, 8225: 135, 1046: 198, 8222: 132, 1047: 199, 1168: 165, 1048: 200, 1113: 154, 1049: 201, 1045: 197, 1050: 202, 1028: 170, 160: 160, 1040: 192, 1051: 203, 164: 164, 166: 166, 167: 167, 169: 169, 171: 171, 172: 172, 173: 173, 174: 174, 1053: 205, 176: 176, 177: 177, 1114: 156, 181: 181, 182: 182, 183: 183, 8221: 148, 187: 187, 1029: 189, 1056: 208, 1057: 209, 1058: 210, 8364: 136, 1112: 188, 1115: 158, 1059: 211, 1060: 212, 1030: 178, 1061: 213, 1062: 214, 1063: 215, 1116: 157, 1064: 216, 1065: 217, 1031: 175, 1066: 218, 1067: 219, 1068: 220, 1069: 221, 1070: 222, 1032: 163, 8226: 149, 1071: 223, 1072: 224, 8482: 153, 1073: 225, 8240: 137, 1118: 162, 1074: 226, 1110: 179, 8230: 133, 1075: 227, 1033: 138, 1076: 228, 1077: 229, 8211: 150, 1078: 230, 1119: 159, 1079: 231, 1042: 194, 1080: 232, 1034: 140, 1025: 168, 1081: 233, 1082: 234, 8212: 151, 1083: 235, 1169: 180, 1084: 236, 1052: 204, 1085: 237, 1035: 142, 1086: 238, 1087: 239, 1088: 240, 1089: 241, 1090: 242, 1036: 141, 1041: 193, 1091: 243, 1092: 244, 8224: 134, 1093: 245, 8470: 185, 1094: 246, 1054: 206, 1095: 247, 1096: 248, 8249: 139, 1097: 249, 1098: 250, 1044: 196, 1099: 251, 1111: 191, 1055: 207, 1100: 252, 1038: 161, 8220: 147, 1101: 253, 8250: 155, 1102: 254, 8216: 145, 1103: 255, 1043: 195, 1105: 184, 1039: 143, 1026: 128, 1106: 144, 8218: 130, 1107: 131, 8217: 146, 1108: 186, 1109: 190}function UnicodeToWin1251(s) {  var L = []  for (var i=0; i<s.length; i++) {    var ord = s.charCodeAt(i)    if (!(ord in DMap))      throw "Character "+s.charAt(i)+" isn't supported by win1251!"    L.push(String.fromCharCode(DMap[ord]))  }  return L.join('')}

Well, this is a good method. Dmap stores the ing between the window-1251 encoding and unicode.

So I was planning to do this by turning it back.

However, the charCodeAt method is only valid for unicode. How do I mine the code segments of other encodings? Because nodejs is used, the corresponding module is considered.

2.

Installation using nodejs module iconv-lite instructions for use see https://www.npmjs.com/package/iconv-lite

Use this method as needed.

Var iconv = require ('iconv-lite '); var Buffer = require ('buffer '). buffer; // Convert from an encoded windows-1251 to UTF-8 // This str1 should be http. the data returned by a get or request must contain parameters. Otherwise, an error may occur. // remember to use encoding in addition to basic parameters: the 'binary 'parameter // For example, str1 = 'commandid was used when there were too many commandid'; // converts the obtained data to a Buffer, remember that the format uses binary // binary to directly communicate with each encoding ~ Var buf = new Buffer (str1, 'binary '); var str2 = iconv. decode (buf, 'win1251 '); // str2 is converted. By default, it is converted to the Unicode format. It is estimated that this is also the original intention of iconv-lite console. log (str2 );

3.

Install and use nodejs module iconv instructions for use see https://github.com/bnoordhuis/node-iconv

(In fact, the essence is to install a node-gyp. I didn't carefully read the official instructions before)

After simple use, it is still garbled like: please refer to the following link for more information: when there are too many threads, there are too many threads, too many threads, too many threads

Http://stackoverflow.com/questions/8693400/nodejs-convertinf-from-windows-1251-to-utf-8

Solution: convert data to binary to encoding: binary (the default encoding is UTF-8)

request({   uri: website_url,  method: 'GET',  encoding: 'binary'}, function (error, response, body) {    body = new Buffer(body, 'binary');    conv = new iconv.Iconv('WINDOWS-1251', 'utf8');    body = conv.convert(body).toString();  }});

--> Also want to say is, iconv needs some environment dependencies when using, see Official Instructions: https://github.com/TooTallNate/node-gyp

Therefore:

First, the python version (such as 2.7) is required;

Second, support for compilation tools is required (the most error occurs in windows)

Error similar to this

Node. If there is no specific version or a later version, the vs2005 compilation tool is used by default (so the solution to the error prompt is generally based on vs2005 and framwork sdk2.0)

Solution:

1. Install visual stutio 2010

2. Specify the vs compilation tool Version (if vs2012 is 2012)

(This command is automatically specified in some cases, and does not necessarily require the npm config set msvs_version 2010 -- global Command)

3. If you still cannot find the framwork sdk, you can add the installation path to the path of the system environment variable.

(2010 corresponds to sdk4.0, similar to 2008 sdj3.5 2012 sdk4.5 ?)

Remember that the environment variable will only read the first one!

For example, if you have already set the path of SDK2.0 to the system environment variable, then when you add a path of SDK4.0, only the first one works.

Therefore:

Or delete the previous one.

Or put the path you want to add in front of it.

Ii. gzip page processing

Sometimes we find that the browser accesses the page normally, but the Response from the simulated request is garbled. You can check the browser request's Response information. If there is Content-Encoding: gzip, most likely because the page is compressed by gzip, the following parameters need to be added during the request:

Gzip: true

The above is all the content of this article. I hope you will like it.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.