Summary of Nodejs crawler data fetching garbled problem

Source: Internet
Author: User
Tags ord

One, non-UTF-8 page processing 1. Background

windows-1251 encoding

For example, Russian website: Https://vk.com/cciinniikk

It is shameful to find that this code

All here is mainly about the Windows-1251 (cp1251) coding and Utf-8 coding problems, others such as GBK is not considered in the first ~

2. Solution

1.

Convert using JS native encoding

But I haven't found a way to do it yet.

If it's utf-8 turn window-1251, you can http://stackoverflow.com/questions/2696481/encoding-conversation-utf-8-to-1251-in-javascript.

var DMap = {0:0, 1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10, 11:11, 12:12, 13:13, 14:14, 15:15, 16:1  6, 17:17, 18:18, 19:19, 20:20, 21:21, 22:22, 23:23, 24:24, 25:25, 26:26, 27:27, 28:28, 29:29, 30:30, 31:31, 32:32, 33:33, 34:34, 35:35, 36:36, 37:37, 38:38, 39:39, 40:40, 41:41, 42:42, 43:43, 44:44, 45:45, 46:46, 4  7:47, 48:48, 49:49, 50:50, 51:51, 52:52, 53:53, 54:54, 55:55, 56:56, 57:57, 58:58, 59:59, 60:60, 61:61, 62: 62, 63:63, 64:64, 65:65, 66:66, 67:67, 68:68, 69:69, 70:70, 71:71, 72:72, 73:73, 74:74, 75:75, 76:76, 77:7  7, 78:78, 79:79, 80:80, 81:81, 82:82, 83:83, 84:84, 85:85, 86:86, 87:87, 88:88, 89:89, 90:90, 91:91, 92:92, 93:93, 94:94, 95:95, 96:96, 97:97, 98:98, 99:99, 100:100, 101:101, 102:102, 103:103, 104:104, 105:105, 106: 106, 107:107, 108:108, 109:109, 110:110, 111:111, 112:112, 113:113, 114:114, 115:115, 116:116, 117:117, 118:11 8, 119:119, 120:120, 121:121, 122:122, 123:123, 124:124, 125:125, 126:126, 127:127, 1027:129, 8,225:135, 1046:198, 8,222:132, 1047:199 , 1168:165, 1048:200, 1113:154, 1049:201, 1045:197, 1050:202, 1028:170, 160:160, 1040:192, 1051:203, 164:164, 1 66:166, 167:167, 169:169, 171:171, 172:172, 173:173, 174:174, 1053:205, 176:176, 177:177, 1114:156, 181:181, 1  82:182, 183:183, 8,221:148, 187:187, 1029:189, 1056:208, 1057:209, 1058:210, 8,364:136, 1112:188, 1115:158, 1059: 211, 1060:212, 1030:178, 1061:213, 1062:214, 1063:215, 1116:157, 1064:216, 1065:217, 1031:175, 1066:218, 1067: 219, 1068:220, 1069:221, 1070:222, 1032:163, 8,226:149, 1071:223, 1072:224, 8,482:153, 1073:225, 8,240:137, 1118:1 62, 1074:226, 1110:179, 8,230:133, 1075:227, 1033:138, 1076:228, 1077:229, 8,211:150, 1078:230, 1119:159, 1079:23 1, 1042:194, 1080:232, 1034:140, 1025:168, 1081:233, 1082:234, 8,212:151, 1083:235, 1169:180, 1084:236, 1052:204 , 1085:237, 1035:142,1086:238, 1087:239, 1088:240, 1089:241, 1090:242, 1036:141, 1041:193, 1091:243, 1092:244, 8,224:134, 1093:245, 8 470:185, 1094:246, 1054:206, 1095:247, 1096:248, 8,249:139, 1097:249, 1098:250, 1044:196, 1099:251, 1111:191, 10 55:207, 1100:252, 1038:161, 8,220:147, 1101:253, 8,250:155, 1102:254, 8,216:145, 1103:255, 1043:195, 1105:184, 103 9:143, 1026:128, 1106:144, 8,218:130, 1107:131, 8,217:146, 1108:186, 1109:190}functionUnicodeToWin1251 (s) {varL = []     for(vari=0; i<s.length; i++) {        varOrd =s.charcodeat (i)if(! (OrdinchDMap)) Throw"Character" +s.charat (i) + "isn ' t supported by win1251!"L.push (String.fromCharCode (Dmap[ord))}returnL.join (' ')}

Well, that's a good idea, dmap storage is actually window-1251 encoding and Unicode mapping relationship http://blog.csdn.net/yao_guet/article/details/7070364

So I'm going to just have to come back here.

But a counter, only found that the charCodeAt method is only valid for Unicode, the other code is how to dig out its code segment? Because it's nodejs, consider using the appropriate module.

2.

installation Use Nodejs module Iconv-lite instructions for use see Https://www.npmjs.com/package/iconv-lite

By using the method, it should be similar to the method used

var iconv = require (' iconv-lite '); var Buffer = require (' buffer '). Buffer; // Convert from a encoded windows-1251 to Utf-8 var str1 = ' ценностинив '; var New Buffer (str1); var str2 = Iconv.decode (buf, ' win1251 '); Console.log (str2);

But surprisingly found that the error is not win1251 encoded? It's not a good way to change a few similar. The official should be right. Then use the Iconv module

3.

installation Use Nodejs module Iconv instructions for use see Https://github.com/bnoordhuis/node-iconv

General simple use, or garbled shape such as:chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat?? Chat??

Http://stackoverflow.com/questions/8693400/nodejs-convertinf-from-windows-1251-to-utf-8

Workaround for binary read Data encoding:binary (default encoding is Utf-8)

request ({     uri:website_url,    ' GET ',    ' binary 'function  (Error, Response, body) {        new Buffer (body, ' binary ');         New Iconv. Iconv (' WINDOWS-1251 ', ' UTF8 ');         = Conv.convert (body). ToString ();    }});

--and what's more, Iconv needs some environmental dependencies, see official note: Https://github.com/TooTallNate/node-gyp

So:

The first need for Python counterpart version (such as 2.7) support;

Second, you need the support of the compilation tool (most errors in Windows)

An error like this

node, such as no specific version or later, uses the VS2005 compilation tool by default (so the resolution of the error prompt is generally based on vs2005 and Framwork sdk2.0)

Problem Solutions:

1. Installing Visual Stutio 2010

2. Specify the VS Compilation tool version (if vs2012 is 2012)

(sometimes it is automatically specified, and all does not necessarily need this command NPM config set msvs_version--global)

3. If you are still prompted to find the Framwork SDK, you can add its installation path to the system environment variable path

(2010 corresponds to sdk4.0 version, similar to sdj3.5-sdk4.5?)

Second, Gzip page processing

Sometimes we find that the browser access page is normal, but the mock request back is garbled, you can view the browser request response information, if there is content-encoding:gzip, most likely because the page was gzip compressed, You need to add the following parameters when requesting

Gzip:true

Summary of Nodejs crawler data fetching garbled problem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.