Summarization of Nodejs crawler crawling data garbled _node.js

Source: Internet
Author: User
Tags ord

One, UTF-8 page processing .

1. Background

windows-1251 Code

such as Russian website: Https://vk.com/cciinniikk

Shamefully found to be this code

All this is mainly about the Windows-1251 (cp1251) coding and Utf-8 coding problems, others such as GBK is not considered in the first ~

2. The solution

1.

Using JS native code conversion

But I haven't found a way to do it yet.

If it's utf-8, window-1251 can http://stackoverflow.com/questions/2696481/encoding-conversation-utf-8-to-1251-in-javascript.

var DMap = {0:0, 1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10, 11:11, 12:12, 13:13, 14:14, 15:15, 1  6:16, 17:17, 18:18, 19:19, 20:20, 21:21, 22:22, 23:23, 24:24, 25:25, 26:26, 27:27, 28:28, 29:29, 30:30, 31: 31, 32:32, 33:33, 34:34, 35:35, 36:36, 37:37, 38:38, 39:39, 40:40, 41:41, 42:42, 43:43, 44:44, 45:45, 46:4  6, 47:47, 48:48, 49:49, 50:50, 51:51, 52:52, 53:53, 54:54, 55:55, 56:56, 57:57, 58:58, 59:59, 60:60, 61:61, 62:62, 63:63, 64:64, 65:65, 66:66, 67:67, 68:68, 69:69, 70:70, 71:71, 72:72, 73:73, 74:74, 75:75, 76:76, 7  7:77, 78:78, 79:79, 80:80, 81:81, 82:82, 83:83, 84:84, 85:85, 86:86, 87:87, 88:88, 89:89, 90:90, 91:91, 92: 92, 93:93, 94:94, 95:95, 96:96, 97:97, 98:98, 99:99, 100:100, 101:101, 102:102, 103:103, 104:104, 105:105, 1 06:106, 107:107, 108:108, 109:109, 110:110, 111:111, 112:112, 113:113, 114:114, 115:115, 116:116, 117:117, 118 : 118, 119:119, 120:120, 121:121, 122:122, 123:123, 124:124, 125:125, 126:126, 127:127, 1027:129, 8,225:135, 1046:198, 8,222:132, 1047: 199, 1168:165, 1048:200, 1113:154, 1049:201, 1045:197, 1050:202, 1028:170, 160:160, 1040:192, 1051:203, 164:16 4, 166:166, 167:167, 169:169, 171:171, 172:172, 173:173, 174:174, 1053:205, 176:176, 177:177, 1114:156, 181:18 1, 182:182, 183:183, 8,221:148, 187:187, 1029:189, 1056:208, 1057:209, 1058:210, 8,364:136, 1112:188, 1115:158, 1 059:211, 1060:212, 1030:178, 1061:213, 1062:214, 1063:215, 1116:157, 1064:216, 1065:217, 1031:175, 1066:218, 10 67:219, 1068:220, 1069:221, 1070:222, 1032:163, 8,226:149, 1071:223, 1072:224, 8,482:153, 1073:225, 8,240:137, 111 8:162, 1074:226, 1110:179, 8,230:133, 1075:227, 1033:138, 1076:228, 1077:229, 8,211:150, 1078:230, 1119:159, 1079  : 231, 1042:194, 1080:232, 1034:140, 1025:168, 1081:233, 1082:234, 8,212:151, 1083:235, 1169:180, 1084:236, 1052: 204, 1085:237, 1035:142, 1086:238, 1087:239, 1088:240, 1089:241, 1090:242, 1036:141, 1041:193, 1091:243, 1092:244, 8,224:134, 1093:24 5, 8,470:185, 1094:246, 1054:206, 1095:247, 1096:248, 8,249:139, 1097:249, 1098:250, 1044:196, 1099:251, 1111:191  , 1055:207, 1100:252, 1038:161, 8,220:147, 1101:253, 8,250:155, 1102:254, 8,216:145, 1103:255, 1043:195, 1105:184,
  1039:143, 1026:128, 1106:144, 8218:130, 1107:131, 8217:146, 1108:186, 1109:190} function UnicodeToWin1251 (s) { var L = [] for (var i=0; i<s.length; i++) {var ord = s.charcodeat (i) if (!) (
    Ord in DMap)) throw "Character" +s.charat (i) + "isn ' t supported by win1251!" L.push (String.fromCharCode (Dmap[ord]))} return L.join (")}

Well, that's a good idea, dmap storage is actually the mapping relationship between window-1251 encoding and Unicode

So I was just going to do it on the back.

However, it was found that the charCodeAt method is only valid for Unicode, and how other encodings exploit its code segment? Because it's nodejs, so consider using the appropriate module.

2.

Installation using NODEJS module Iconv-lite use instructions see Https://www.npmjs.com/package/iconv-lite

By using the method, it should be similar to using the

var iconv = require (' Iconv-lite ');
var buffer = require (' buffer '). Buffer;
Convert from a encoded windows-1251 to Utf-8/
/This str1 should be the data//request returned by Http.get or request with
parameters, or it will be wrong.
/In addition to the basic parameters to remember to use encoding: ' binary ' this parameter
///such as
str1 = ' ценностинив ';
Convert the acquired data into Buffer, and remember that the format uses binary
//binary in the encoding of the direct shuttle-
var buf = new Buffer (str1, ' binary ');
var str2 = Iconv.decode (buf, ' win1251 ');
STR2 is converted, the default is to convert to Unicode format, it is estimated that this is the original intention of Iconv-lite
Console.log (STR2);

3.

Installation using NODEJS module ICONV use instructions see Https://github.com/bnoordhuis/node-iconv

(In fact, the essence should be to install a node-gyp on the line did not carefully read the official instructions)

Generally simple to use, or garbled form such as: пїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕ

Http://stackoverflow.com/questions/8693400/nodejs-convertinf-from-windows-1251-to-utf-8

The workaround is to read Data encoding:binary in binary (the default encoding is Utf-8)

Request ({ 
  Uri:website_url, method
  : "Get",
  encoding: ' Binary '
}, function (error, response, body) { Body
    = new Buffer (the body, ' binary ');
    CONV = new Iconv. Iconv (' WINDOWS-1251 ', ' UTF8 ');
    BODY = Conv.convert (body). ToString ();
  }
});

--> Another thing to say is that the use of iconv need some environmental dependence, see official note: Https://github.com/TooTallNate/node-gyp

So:

The first requirement is the support of the Python counterpart version (2.7);

Second requires support for compilation tools (most errors under Windows)

Error similar to this

node, such as no specific version or later, uses the VS2005 compilation tool by default (so error-prone solutions are generally based on vs2005 and Framwork sdk2.0)

Problem Solving Solution:

1. Install Visual Stutio 2010

2. Specify vs Compilation Tool version (if it is vs2012 is 2012)

(sometimes it is automatically specified, and all do not necessarily need this command NPM config set msvs_version--global)

3. If you still are prompted to find the Framwork SDK, you can add its installation path to the system environment variable path

(2010 corresponds to sdk4.0 version, similar 2008 sdj3.5 sdk4.5?)

Also remember is that the environment variable will only read the first one!

For example, you already have a SDK2.0 path set to the system environment variable, then you add a SDK4.0 path now, the only thing that works is the first

So:

or delete the previous one.

Or put the path you want to add to the front.

Two, Gzip page processing

Sometimes we find that the browser access page is normal, but the simulation request back garbled, you can view the browser request response information, if there is content-encoding:gzip, most likely because the page was gzip compressed, The following parameters need to be added at this request

Gzip:true

The above mentioned is the entire content of this article, I hope you can enjoy.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.