One, UTF-8 page processing .
1. Background
windows-1251 Code
such as Russian website: Https://vk.com/cciinniikk
Shamefully found to be this code
All this is mainly about the Windows-1251 (cp1251) coding and Utf-8 coding problems, others such as GBK is not considered in the first ~
2. The solution
1.
Using JS native code conversion
But I haven't found a way to do it yet.
If it's utf-8, window-1251 can http://stackoverflow.com/questions/2696481/encoding-conversation-utf-8-to-1251-in-javascript.
var DMap = {0:0, 1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10, 11:11, 12:12, 13:13, 14:14, 15:15, 1 6:16, 17:17, 18:18, 19:19, 20:20, 21:21, 22:22, 23:23, 24:24, 25:25, 26:26, 27:27, 28:28, 29:29, 30:30, 31: 31, 32:32, 33:33, 34:34, 35:35, 36:36, 37:37, 38:38, 39:39, 40:40, 41:41, 42:42, 43:43, 44:44, 45:45, 46:4 6, 47:47, 48:48, 49:49, 50:50, 51:51, 52:52, 53:53, 54:54, 55:55, 56:56, 57:57, 58:58, 59:59, 60:60, 61:61, 62:62, 63:63, 64:64, 65:65, 66:66, 67:67, 68:68, 69:69, 70:70, 71:71, 72:72, 73:73, 74:74, 75:75, 76:76, 7 7:77, 78:78, 79:79, 80:80, 81:81, 82:82, 83:83, 84:84, 85:85, 86:86, 87:87, 88:88, 89:89, 90:90, 91:91, 92: 92, 93:93, 94:94, 95:95, 96:96, 97:97, 98:98, 99:99, 100:100, 101:101, 102:102, 103:103, 104:104, 105:105, 1 06:106, 107:107, 108:108, 109:109, 110:110, 111:111, 112:112, 113:113, 114:114, 115:115, 116:116, 117:117, 118 : 118, 119:119, 120:120, 121:121, 122:122, 123:123, 124:124, 125:125, 126:126, 127:127, 1027:129, 8,225:135, 1046:198, 8,222:132, 1047: 199, 1168:165, 1048:200, 1113:154, 1049:201, 1045:197, 1050:202, 1028:170, 160:160, 1040:192, 1051:203, 164:16 4, 166:166, 167:167, 169:169, 171:171, 172:172, 173:173, 174:174, 1053:205, 176:176, 177:177, 1114:156, 181:18 1, 182:182, 183:183, 8,221:148, 187:187, 1029:189, 1056:208, 1057:209, 1058:210, 8,364:136, 1112:188, 1115:158, 1 059:211, 1060:212, 1030:178, 1061:213, 1062:214, 1063:215, 1116:157, 1064:216, 1065:217, 1031:175, 1066:218, 10 67:219, 1068:220, 1069:221, 1070:222, 1032:163, 8,226:149, 1071:223, 1072:224, 8,482:153, 1073:225, 8,240:137, 111 8:162, 1074:226, 1110:179, 8,230:133, 1075:227, 1033:138, 1076:228, 1077:229, 8,211:150, 1078:230, 1119:159, 1079 : 231, 1042:194, 1080:232, 1034:140, 1025:168, 1081:233, 1082:234, 8,212:151, 1083:235, 1169:180, 1084:236, 1052: 204, 1085:237, 1035:142, 1086:238, 1087:239, 1088:240, 1089:241, 1090:242, 1036:141, 1041:193, 1091:243, 1092:244, 8,224:134, 1093:24 5, 8,470:185, 1094:246, 1054:206, 1095:247, 1096:248, 8,249:139, 1097:249, 1098:250, 1044:196, 1099:251, 1111:191 , 1055:207, 1100:252, 1038:161, 8,220:147, 1101:253, 8,250:155, 1102:254, 8,216:145, 1103:255, 1043:195, 1105:184,
1039:143, 1026:128, 1106:144, 8218:130, 1107:131, 8217:146, 1108:186, 1109:190} function UnicodeToWin1251 (s) { var L = [] for (var i=0; i<s.length; i++) {var ord = s.charcodeat (i) if (!) (
Ord in DMap)) throw "Character" +s.charat (i) + "isn ' t supported by win1251!" L.push (String.fromCharCode (Dmap[ord]))} return L.join (")}
Well, that's a good idea, dmap storage is actually the mapping relationship between window-1251 encoding and Unicode
So I was just going to do it on the back.
However, it was found that the charCodeAt method is only valid for Unicode, and how other encodings exploit its code segment? Because it's nodejs, so consider using the appropriate module.
2.
Installation using NODEJS module Iconv-lite use instructions see Https://www.npmjs.com/package/iconv-lite
By using the method, it should be similar to using the
var iconv = require (' Iconv-lite ');
var buffer = require (' buffer '). Buffer;
Convert from a encoded windows-1251 to Utf-8/
/This str1 should be the data//request returned by Http.get or request with
parameters, or it will be wrong.
/In addition to the basic parameters to remember to use encoding: ' binary ' this parameter
///such as
str1 = ' ценностинив ';
Convert the acquired data into Buffer, and remember that the format uses binary
//binary in the encoding of the direct shuttle-
var buf = new Buffer (str1, ' binary ');
var str2 = Iconv.decode (buf, ' win1251 ');
STR2 is converted, the default is to convert to Unicode format, it is estimated that this is the original intention of Iconv-lite
Console.log (STR2);
3.
Installation using NODEJS module ICONV use instructions see Https://github.com/bnoordhuis/node-iconv
(In fact, the essence should be to install a node-gyp on the line did not carefully read the official instructions)
Generally simple to use, or garbled form such as: пїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕпїѕ
Http://stackoverflow.com/questions/8693400/nodejs-convertinf-from-windows-1251-to-utf-8
The workaround is to read Data encoding:binary in binary (the default encoding is Utf-8)
Request ({
Uri:website_url, method
: "Get",
encoding: ' Binary '
}, function (error, response, body) { Body
= new Buffer (the body, ' binary ');
CONV = new Iconv. Iconv (' WINDOWS-1251 ', ' UTF8 ');
BODY = Conv.convert (body). ToString ();
}
});
--> Another thing to say is that the use of iconv need some environmental dependence, see official note: Https://github.com/TooTallNate/node-gyp
So:
The first requirement is the support of the Python counterpart version (2.7);
Second requires support for compilation tools (most errors under Windows)
Error similar to this
node, such as no specific version or later, uses the VS2005 compilation tool by default (so error-prone solutions are generally based on vs2005 and Framwork sdk2.0)
Problem Solving Solution:
1. Install Visual Stutio 2010
2. Specify vs Compilation Tool version (if it is vs2012 is 2012)
(sometimes it is automatically specified, and all do not necessarily need this command NPM config set msvs_version--global)
3. If you still are prompted to find the Framwork SDK, you can add its installation path to the system environment variable path
(2010 corresponds to sdk4.0 version, similar 2008 sdj3.5 sdk4.5?)
Also remember is that the environment variable will only read the first one!
For example, you already have a SDK2.0 path set to the system environment variable, then you add a SDK4.0 path now, the only thing that works is the first
So:
or delete the previous one.
Or put the path you want to add to the front.
Two, Gzip page processing
Sometimes we find that the browser access page is normal, but the simulation request back garbled, you can view the browser request response information, if there is content-encoding:gzip, most likely because the page was gzip compressed, The following parameters need to be added at this request
Gzip:true
The above mentioned is the entire content of this article, I hope you can enjoy.