One, non-UTF-8 page processing 1. Background
windows-1251 encoding
For example, Russian website: Https://vk.com/cciinniikk
It is shameful to find that this code
All here is mainly about the Windows-1251 (cp1251) coding and Utf-8 coding problems, others such as GBK is not considered in the first ~
2. Solution
1.
Convert using JS native encoding
But I haven't found a way to do it yet.
If it's utf-8 turn window-1251, you can http://stackoverflow.com/questions/2696481/encoding-conversation-utf-8-to-1251-in-javascript.
var DMap = {0:0, 1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10, 11:11, 12:12, 13:13, 14:14, 15:15, 16:1 6, 17:17, 18:18, 19:19, 20:20, 21:21, 22:22, 23:23, 24:24, 25:25, 26:26, 27:27, 28:28, 29:29, 30:30, 31:31, 32:32, 33:33, 34:34, 35:35, 36:36, 37:37, 38:38, 39:39, 40:40, 41:41, 42:42, 43:43, 44:44, 45:45, 46:46, 4 7:47, 48:48, 49:49, 50:50, 51:51, 52:52, 53:53, 54:54, 55:55, 56:56, 57:57, 58:58, 59:59, 60:60, 61:61, 62: 62, 63:63, 64:64, 65:65, 66:66, 67:67, 68:68, 69:69, 70:70, 71:71, 72:72, 73:73, 74:74, 75:75, 76:76, 77:7 7, 78:78, 79:79, 80:80, 81:81, 82:82, 83:83, 84:84, 85:85, 86:86, 87:87, 88:88, 89:89, 90:90, 91:91, 92:92, 93:93, 94:94, 95:95, 96:96, 97:97, 98:98, 99:99, 100:100, 101:101, 102:102, 103:103, 104:104, 105:105, 106: 106, 107:107, 108:108, 109:109, 110:110, 111:111, 112:112, 113:113, 114:114, 115:115, 116:116, 117:117, 118:11 8, 119:119, 120:120, 121:121, 122:122, 123:123, 124:124, 125:125, 126:126, 127:127, 1027:129, 8,225:135, 1046:198, 8,222:132, 1047:199 , 1168:165, 1048:200, 1113:154, 1049:201, 1045:197, 1050:202, 1028:170, 160:160, 1040:192, 1051:203, 164:164, 1 66:166, 167:167, 169:169, 171:171, 172:172, 173:173, 174:174, 1053:205, 176:176, 177:177, 1114:156, 181:181, 1 82:182, 183:183, 8,221:148, 187:187, 1029:189, 1056:208, 1057:209, 1058:210, 8,364:136, 1112:188, 1115:158, 1059: 211, 1060:212, 1030:178, 1061:213, 1062:214, 1063:215, 1116:157, 1064:216, 1065:217, 1031:175, 1066:218, 1067: 219, 1068:220, 1069:221, 1070:222, 1032:163, 8,226:149, 1071:223, 1072:224, 8,482:153, 1073:225, 8,240:137, 1118:1 62, 1074:226, 1110:179, 8,230:133, 1075:227, 1033:138, 1076:228, 1077:229, 8,211:150, 1078:230, 1119:159, 1079:23 1, 1042:194, 1080:232, 1034:140, 1025:168, 1081:233, 1082:234, 8,212:151, 1083:235, 1169:180, 1084:236, 1052:204 , 1085:237, 1035:142,1086:238, 1087:239, 1088:240, 1089:241, 1090:242, 1036:141, 1041:193, 1091:243, 1092:244, 8,224:134, 1093:245, 8 470:185, 1094:246, 1054:206, 1095:247, 1096:248, 8,249:139, 1097:249, 1098:250, 1044:196, 1099:251, 1111:191, 10 55:207, 1100:252, 1038:161, 8,220:147, 1101:253, 8,250:155, 1102:254, 8,216:145, 1103:255, 1043:195, 1105:184, 103 9:143, 1026:128, 1106:144, 8,218:130, 1107:131, 8,217:146, 1108:186, 1109:190}functionUnicodeToWin1251 (s) {varL = [] for(vari=0; i<s.length; i++) { varOrd =s.charcodeat (i)if(! (OrdinchDMap)) Throw"Character" +s.charat (i) + "isn ' t supported by win1251!"L.push (String.fromCharCode (Dmap[ord))}returnL.join (' ')}
Well, that's a good idea, dmap storage is actually window-1251 encoding and Unicode mapping relationship http://blog.csdn.net/yao_guet/article/details/7070364
So I'm going to just have to come back here.
But a counter, only found that the charCodeAt method is only valid for Unicode, the other code is how to dig out its code segment? Because it's nodejs, consider using the appropriate module.
2.
installation Use Nodejs module Iconv-lite instructions for use see Https://www.npmjs.com/package/iconv-lite
By using the method, it should be similar to the method used
var iconv = require (' iconv-lite '); var Buffer = require (' buffer '). Buffer; // Convert from a encoded windows-1251 to Utf-8 var str1 = ' ценностинив '; var New Buffer (str1); var str2 = Iconv.decode (buf, ' win1251 '); Console.log (str2);
But surprisingly found that the error is not win1251 encoded? It's not a good way to change a few similar. The official should be right. Then use the Iconv module
3.
installation Use Nodejs module Iconv instructions for use see Https://github.com/bnoordhuis/node-iconv
General simple use, or garbled shape such as:chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat??
Http://stackoverflow.com/questions/8693400/nodejs-convertinf-from-windows-1251-to-utf-8
Workaround for binary read Data encoding:binary (default encoding is Utf-8)
request ({ uri:website_url, ' GET ', ' binary 'function (Error, Response, body) { new Buffer (body, ' binary '); New Iconv. Iconv (' WINDOWS-1251 ', ' UTF8 '); = Conv.convert (body). ToString (); }});
--and what's more, Iconv needs some environmental dependencies, see official note: Https://github.com/TooTallNate/node-gyp
So:
The first need for Python counterpart version (such as 2.7) support;
Second, you need the support of the compilation tool (most errors in Windows)
An error like this
node, such as no specific version or later, uses the VS2005 compilation tool by default (so the resolution of the error prompt is generally based on vs2005 and Framwork sdk2.0)
Problem Solutions:
1. Installing Visual Stutio 2010
2. Specify the VS Compilation tool version (if vs2012 is 2012)
(sometimes it is automatically specified, and all does not necessarily need this command NPM config set msvs_version--global)
3. If you are still prompted to find the Framwork SDK, you can add its installation path to the system environment variable path
(2010 corresponds to sdk4.0 version, similar to sdj3.5-sdk4.5?)
Second, Gzip page processing
Sometimes we find that the browser access page is normal, but the mock request back is garbled, you can view the browser request response information, if there is content-encoding:gzip, most likely because the page was gzip compressed, You need to add the following parameters when requesting
Gzip:true
Summary of Nodejs crawler data fetching garbled problem