C + + uses Iconv for page character conversion

Source: Internet
Author: User
Tags install php translit

when crawling HTML pages, there are always different encodings, and we don't usually do one by one of these encodings, but instead collectively convert them into the same code and easily mount the database. At this point, Iconv becomes a very convenient tool. 

iconv header File "Iconv.h". The Iconv command converts a well-known character set file to another known character set file. It is the function of a variety of international encoding format for the conversion of text within the code. function prototypes under Linux size_t iconv (iconv_t CD, Char **inbuf, size_t *inbytesleft, Char **outbuf, size_t *outbytesleft); Iconv is the name of a computer program and a set of application programming interfaces. As an application, Iconv takes a command-line interface that allows you to convert a specific encoded file to another encoding. Iconv public source code based on the GPL is part of the GNU project. Can be used under a variety of UNIX operating systems, while Windows systems require special environments such as Cygwin or GnuWin32 to be used under software platforms. Now there are also Windows systems running on SourceForge, and you need to install the GetText program at the same time. The current version is 2.3.26, the supported internal codes include: Unicode-related codes, such as UTF-8, UTF-16, etc., the ANSI code used by countries, including GB2312, BIG5 and other Chinese encoding methods. The main content of ICONV as a programming interface consists of 3 functions: The Iconv_open function is used to initialize the internal buffer used for the conversion, indicating which encoding to convert to. The Iconv function does the actual conversion and needs to give two indirect buffer pointers and the remaining number of bytes pointers. This function needs to update all relevant information, so it is wrong to pass a non-rewritable pointer to Iconv. The Iconv_close function releases the buffer of the Iconv_open function.


LinuxThe Iconv command is used to convert the encoding of a file, such as the ability to convert UTF8 encoded into GB18030 encoding, and vice versa. A similar tool NATIVE2ASCII is also available in the JDK. The Iconv Development library under Linux includes C functions such as Iconv_open,iconv_close,iconv, which can be used to easily convert character encodings in C/D + + programs, which is useful in the process of crawling Web pages, and Iconv commands are needed to debug such programs. Usage: iconv[Options ...] [File ...] Options: Input/output format specification:-f,--from-code=name text encoding-t,--to-code=name output encoding information:-L,--List enumerates all known character Set output controls:-C ignores invalid characters from the output-o,--output= File output files-s,--silentsuppresswarnings--verbose printing progress Information-?,--Help gives a list of the system--usage give a brief usage information-V,-- Version Print program build number example: List supported character encodings [1] [[email protected]~]# Iconv-lthefollowinglistcontainallthecodedcharactersetsknown.thisdoesnotnecessarilymeanthatallcombinationsofthesenamescanb Eusedforthefromandtocommandlineparameters.onecodedcharactersetcanbelistedwithseveraldifferentnames (aliases). All known character sets 437,500,500v1,850,851,852,855,856,857,860,861,862,863,864,865,866,866nav, 869,874,904,1026,1046,1047,8859_1,8859_2,8859_3,8859_4,8859_5,8859_6,8859_7,8859_8,8859_ 9,10646-1:1993,10646-1:1993/ucs4,ansi_x3.4-1968,ansi_x3.4-1986,ansi_x3.4,ansi_x3.110-1983,ansi_x3.110,arabic, Arabic7,armscii-8,ascii,asmo-708,ASMO_449,BALTIC,BIG-5,BIG-FIVE,BIG5-HKSCS,BIG5,BIG5HKSCS,BIGFIVE,BS_4730,CA,CN-BIG5,CN-GB,CN,CP-AR,CP-GR, cp-hu,cp037,cp038,cp273,cp274,cp275,cp278,cp280,cp281,cp282,cp284,cp285,cp290,cp297,cp367,cp420,cp423,cp424, cp437,cp500,cp737,cp775,cp813,cp819,cp850,cp851,cp852,cp855,cp856,cp857,cp860,cp861,cp862,cp863,cp864,cp865, cp866,cp866nav,cp868,cp869,cp870,cp871,cp874,cp875,cp880,cp891,cp903,cp904,cp905,cp912,cp915,cp916,cp918,cp920 , cp922,cp930,cp932,cp933,cp935,cp936,cp937,cp939,cp949,cp950,cp1004,cp1026,cp1046,cp1047,cp1070,cp1079,cp1081, cp1084,cp1089,cp1124,cp1125,cp1129,cp1132,cp1133,cp1160,cp1161,cp1162,cp1163,cp1164,cp1250,cp1251,cp1252, cp1253,cp1254,cp1255,cp1256,cp1257,cp1258,cp1361,cp10007,cpibm861,csa7-1,csa7-2,csascii,csa_t500-1983,csa_t500 , Csa_z243.4-1985-1,csa_z243.4-1985-2,csa_z243.419851,csa_z243.419852,csdecmcs,csebcdicatde,csebcdicatdea, Csebcdiccafr,csebcdicdkno,csebcdicdknoa,csebcdices,csebcdicesa,csebcdicess,csebcdicfise,csebcdicfisea, Csebcdicfr,csebcdicit,csebcdicpt,csebcdicuk,csebcdicus,cseuckr,cseucpkdfmtjapanese,csgb2312,cshproman8,csibm037,csibm038,csibm273,csibm274, csibm275,csibm277,csibm278,csibm280,csibm281,csibm284,csibm285,csibm290,csibm297,csibm420,csibm423,csibm424, csibm500,csibm851,csibm855,csibm856,csibm857,csibm860,csibm863,csibm864,csibm865,csibm866,csibm868,csibm869, csibm870,csibm871,csibm880,csibm891,csibm903,csibm904,csibm905,csibm918,csibm922,csibm930,csibm932,csibm933, csibm935,csibm937,csibm939,csibm943,csibm1026,csibm1124,csibm1129,csibm1132,csibm1133,csibm1160,csibm1161, Csibm1163,csibm1164,csibm11621162,csiso4unitedkingdom,csiso10swedish,csiso11swedishfornames,csiso14jisc6220ro, Csiso15italian,csiso16portugese,csiso17spanish,csiso18greek7old,csiso19latingreek,csiso21german,csiso25french, Csiso27latingreek1,csiso49inis,csiso50inis8,csiso51iniscyrillic,csiso58gb1988,csiso60danishnorwegian, Csiso60norwegian1,csiso61norwegian2,csiso69french,csiso84portuguese2,csiso85spanish2,csiso86hungarian, Csiso88greek7,csiso89asmo449,csiso90,csiso92jisc62991984b,csiso99naplps,csiso103t618bit,csiso111ecmacyrillic,csiso121canadian1,csiso122canadian2, Csiso139csn369103,csiso141jusib1002,csiso143iecp271,csiso150,csiso150greekccitt,csiso151cuba, csiso153gost1976874,csiso646danish,csiso2022cn,csiso2022jp,csiso2022jp2,csiso2022kr,csiso2033, Csiso5427cyrillic,csiso5427cyrillic1981,csiso5428greek,csiso10367box,csisolatin1,csisolatin2,csisolatin3, Csisolatin4,csisolatin5,csisolatin6,csisolatinarabic,csisolatincyrillic,csisolatingreek,csisolatinhebrew, Cskoi8r,csksc5636,csmacintosh,csnatsdano,csnatssefi,csn_369103,cspc8codepage437,cspc775baltic, Cspc850multilingual,cspc862latinhebrew,cspcp852,csshiftjis,csucs4,csunicode,cuba,cwi-2,cwi,cyrillic,de,dec-mcs , DEC,DECMCS,DIN_66003,DK,DS2089,DS_2089,E13B,EBCDIC-AT-DE-A,EBCDIC-AT-DE,EBCDIC-BE,EBCDIC-BR,EBCDIC-CA-FR, Ebcdic-cp-ar1,ebcdic-cp-ar2,ebcdic-cp-be,ebcdic-cp-ca,ebcdic-cp-ch,ebcdic-cp-dk,ebcdic-cp-es,ebcdic-cp-fi, Ebcdic-cp-fr,ebcdic-cp-gb,ebcdic-cp-gr,ebcdic-cp-he,ebcdic-cp-is,ebcdic-cp-it,ebcdic-cp-nl,ebcdic-cp-no,ebcdic-cp-roece,ebcdic-cp-se,ebcdic-cp-tr,ebcdic-cp-us,ebcdic-cp-wt,ebcdic-cp-yu, Ebcdic-cyrillic,ebcdic-dk-no-a,ebcdic-dk-no,ebcdic-es-a,ebcdic-es-s,ebcdic-es,ebcdic-fi-se-a,ebcdic-fi-se, EBCDIC-FR,EBCDIC-GREEK,EBCDIC-INT,EBCDIC-INT1,EBCDIC-IS-FRISS,EBCDIC-IT,EBCDIC-JP-E,EBCDIC-JP-KANA,EBCDIC-PT, Ebcdic-uk,ebcdic-us,ebcdicatde,ebcdicatdea,ebcdiccafr,ebcdicdkno,ebcdicdknoa,ebcdices,ebcdicesa,ebcdicess, ebcdicfise,ebcdicfisea,ebcdicfr,ebcdicisfriss,ebcdicit,ebcdicpt,ebcdicuk,ebcdicus,ecma-114,ecma-118,ecma-128, Ecma-cyrillic,ecmacyrillic,elot_928,es,es2,euc-cn,euc-jisx0213,euc-jp,euc-kr,euc-tw,euccn,eucjp,euckr,euctw,fi , fr,gb,gb2312,gb13000,gb18030,gbk,gb_1988-80,gb_198880,georgian-academy,georgian-ps,gost_19768-74,gost_19768, Gost_1976874,greek-ccitt,greek,greek7-old,greek7,greek7old,greek8,greekccitt,hebrew,hp-roman8,hproman8,hu, ibm-856,ibm-922,ibm-930,ibm-932,ibm-933,ibm-935,ibm-937,ibm-939,ibm-943,ibm-1046,ibm-1124,ibm-1129,ibm-1132, ibm-1133,ibm-1160,ibm-1161,ibm-1162,ibm-1163,ibm-1164,ibm037,ibm038,ibm256,ibm273,ibm274,ibm275,ibm277,ibm278,ibm280,ibm281,ibm284,ibm285,ibm290, ibm297,ibm367,ibm420,ibm423,ibm424,ibm437,ibm500,ibm775,ibm813,ibm819,ibm848,ibm850,ibm851,ibm852,ibm855, ibm856,ibm857,ibm860,ibm861,ibm862,ibm863,ibm864,ibm865,ibm866,ibm866nav,ibm868,ibm869,ibm870,ibm871,ibm874, ibm875,ibm880,ibm891,ibm903,ibm904,ibm905,ibm912,ibm915,ibm916,ibm918,ibm920,ibm922,ibm930,ibm932,ibm933, ibm935,ibm937,ibm939,ibm943,ibm1004,ibm1026,ibm1046,ibm1047,ibm1089,ibm1124,ibm1129,ibm1132,ibm1133,ibm1160, ibm1161,ibm1162,ibm1163,ibm1164,iec_p27-1,iec_p271,inis-8,inis-cyrillic,inis,inis8,iniscyrillic,isiri-3342, Isiri3342,iso-2022-cn-ext,iso-2022-cn,iso-2022-jp-2,iso-2022-jp-3,iso-2022-jp,iso-2022-kr,iso-8859-1, ISO-8859-2,ISO-8859-3,ISO-8859-4,ISO-8859-5,ISO-8859-6,ISO-8859-7,ISO-8859-8,ISO-8859-9,ISO-8859-10, iso-8859-11,iso-8859-13,iso-8859-14,iso-8859-15,iso-8859-16,iso-10646,iso-10646/ucs2,iso-10646/ucs4,iso-10646/ Utf-8,iso-10646/utf8,iso-celtic,iso-ir-4,iso-ir-6,iso-ir-8-1,iso-ir-9-1,iso-ir-10,iso-ir-11,iso-ir-14,iso-ir-15,iso-ir-16,iso-ir-17,iso-ir-18,iso-ir-19, ISO-IR-21,ISO-IR-25,ISO-IR-27,ISO-IR-37,ISO-IR-49,ISO-IR-50,ISO-IR-51,ISO-IR-54,ISO-IR-55,ISO-IR-57,ISO-IR-60, Iso-ir-61,iso-ir-69,iso-ir-84,iso-ir-85,iso-ir-86,iso-ir-88,iso-ir-89,iso-ir-90,iso-ir-92,iso-ir-98,iso-ir-99, iso-ir-100,iso-ir-101,iso-ir-103,iso-ir-109,iso-ir-110,iso-ir-111,iso-ir-121,iso-ir-122,iso-ir-126,iso-ir-127, iso-ir-138,iso-ir-139,iso-ir-141,iso-ir-143,iso-ir-144,iso-ir-148,iso-ir-150,iso-ir-151,iso-ir-153,iso-ir-155, iso-ir-156,iso-ir-157,iso-ir-166,iso-ir-179,iso-ir-193,iso-ir-197,iso-ir-199,iso-ir-203,iso-ir-209,iso-ir-226, ISO646-CA,ISO646-CA2,ISO646-CN,ISO646-CU,ISO646-DE,ISO646-DK,ISO646-ES,ISO646-ES2,ISO646-FI,ISO646-FR, ISO646-FR1,ISO646-GB,ISO646-HU,ISO646-IT,ISO646-JP-OCR-B,ISO646-JP,ISO646-KR,ISO646-NO,ISO646-NO2,ISO646-PT, ISO646-PT2,ISO646-SE,ISO646-SE2,ISO646-US,ISO646-YU,ISO2022CN,ISO2022CNEXT,ISO2022JP,ISO2022JP2,ISO2022KR, Iso6937,iso8859-1,iso8859-2,iSO8859-3,ISO8859-4,ISO8859-5,ISO8859-6,ISO8859-7,ISO8859-8,ISO8859-9,ISO8859-10,ISO8859-11,ISO8859-13, iso8859-14,iso8859-15,iso8859-16,iso88591,iso88592,iso88593,iso88594,iso88595,iso88596,iso88597,iso88598, iso88599,iso885910,iso885911,iso885913,iso885914,iso885915,iso885916,iso_646.irv:1991,iso_2033-1983,iso_2033, Iso_5427-ext,iso_5427,iso_5427:1981,iso_5427ext,iso_5428,iso_5428:1980,iso_6937-2,iso_6937-2:1983,iso_6937,iso _6937:1992,iso_8859-1,iso_8859-1:1987,iso_8859-2,iso_8859-2:1987,iso_8859-3,iso_8859-3:1988,iso_8859-4,iso_ 8859-4:1988,iso_8859-5,iso_8859-5:1988,iso_8859-6,iso_8859-6:1987,iso_8859-7,iso_8859-7:1987,iso_8859-8,iso_ 8859-8:1988,iso_8859-9,iso_8859-9:1989,iso_8859-10,iso_8859-10:1992,iso_8859-14,iso_8859-14:1998,iso_ 8859-15:1998,iso_9036,iso_10367-box,iso_10367box,iso_69372,it,jis_c6220-1969-ro,jis_c6229-1984-b,jis_ c62201969ro,jis_c62291984b,johab,jp-ocr-b,jp,js,jus_i.b1.002,koi-7,koi-8,koi8-r,koi8-t,koi8-u,koi8,koi8r,koi8u , Ksc5636,l1,l2,l3,l4,l5,l6,l7,l8,l10,latin-greeK-1,LATIN-GREEK,LATIN1,LATIN2,LATIN3,LATIN4,LATIN5,LATIN6,LATIN7,LATIN8,LATIN10,LATINGREEK,LATINGREEK1, Mac-cyrillic,mac-is,mac-sami,mac-uk,mac,maccyrillic,macintosh,macis,macuk,macukrainian,ms-ansi,ms-arab,ms-cyrl , Ms-ee,ms-greek,ms-hebr,ms-mac-cyrillic,ms-turk,mscp949,mscp1361,msmaccyrillic,msz_7795.3,ms_kanji,naplps, Nats-dano,nats-sefi,natsdano,natssefi,nc_nc0010,nc_nc00-10,nc_nc00-10:81,nf_z_62-010,nf_z_62-010_ (1973), NF_Z_ 62-010_1973,NF_Z_62010,NF_Z_62010_1973,NO,NO2,NS_4551-1,NS_4551-2,NS_45511,NS_45512,OS2LATIN1,OSF00010001, OSF00010002,OSF00010003,OSF00010004,OSF00010005,OSF00010006,OSF00010007,OSF00010008,OSF00010009,OSF0001000A, OSF00010020,OSF00010100,OSF00010101,OSF00010102,OSF00010104,OSF00010105,OSF00010106,OSF00030010,OSF0004000A, OSF0005000A,OSF05010001,OSF100201A4,OSF100201A8,OSF100201B5,OSF100201F4,OSF100203B5,OSF1002011C,OSF1002011D, osf1002035d,osf1002035e,osf1002035f,osf1002036b,osf1002037b,osf10010001,osf10020025,osf10020111,osf10020115, osf10020116,osf10020118,osf10020122,osf10020129,osf10020352,osf10020354,osf10020357,osf10020359,osf10020360,osf10020364,osf10020365,osf10020366, Osf10020367,osf10020370,osf10020387,osf10020388,osf10020396,osf10020402,osf10020417,pt,pt2,r8,roman8,ruscii,se , se2,sen_850200_b,sen_850200_c,shift-jis,shift_jis,shift_jisx0213,sjis,ss636127,st_sev_358-88,t.61-8bit,t.61, t.618bit,tcvn-5712,tcvn,tcvn5712-1,tcvn5712-1:1993,tis-620,tis620-0,tis620.2529-1,tis620.2533-0,tis620,ts-5881 , Tscii,ucs-2,ucs-2be,ucs-2le,ucs-4,ucs-4be,ucs-4le,ucs2,ucs4,uhc,ujis,uk,unicode,unicodebig,unicodelittle, Us-ascii,us,utf-7,utf-8,utf-16,utf-16be,utf-16le,utf-32,utf-32be,utf-32le,utf7,utf8,utf16,utf16be,utf16le, utf32,utf32be,utf32le,viscii,wchar_t,win-sami-2,winbaltrim,windows-1250,windows-1251,windows-1252,windows-1253 , Windows-1254,windows-1255,windows-1256,windows-1257,windows-1258,winsami2,ws2,yu
directive: #iconv-F gb2312-t utf-8 gb1.txt > Gb2.txt convert GB1 encoding from GB2312 to UTF-8 and redirect to Gb2.txt in addition to the Iconv command, our man under the Linux system You can also see a set of Iconv functions in the third section of the page. They were Iconv_ticonv_open (Constchar*tocode,constchar*fromcode); Size_ticonv (iconv_tcd,char**inbuf,size_t* Inbytesleft,char**outbuf,size_t*outbytesleft); Inticonv_close (ICONV_TCD); the Iconv_open function is used to open a stream of encoded transformations, The role of the ICONV function is to actually convert, and the Iconv_close function is to close the stream. For practical use see the following example, here is an example of converting the UTF-8 code into a GBK code, and we assume that there is a UFT8 encoded input buffer inbuf and the length of the buffer inlen. Iconv_tcd=iconv_open ("GBK", "UTF-8"), char*outbuf= (char*) malloc (inlen*4); bzero (outbuf,inlen*4); char*in=inbuf; Char*out=outbuf;size_toutlen=inlen*4;iconv (Cd,&in, (size_t*) &inlen,&out,&outlen); Outlen=strlen ( OUTBUF);p rintf ("%s\n", outbuf); free (outbuf); Iconv_close (CD); it is very noteworthy that the ICONV function modifies the point where the parameter in and the parameter out pointer are pointing, that is, Before calling the Iconv function, our in and inbuf pointers, as well as the out and outbuf pointers, point to the same chunk of memory, but the place that the out pointer points to after the call is not outbuf, the same in pointer. So be char*in=inbuf;char*out=outbuf, save, use or release the memory, use the original pointer outbuf and INBUF. PHP under the method of using UNIX to install PHP module, you need to recompile php,windows under the installation template, just the configuration in the php.ini to open the correspondingDLL can, for example, need to join the GB library support, need the following settings: Extension_dir= "c:/ipaddr/php/extensions/" (note, it is recommended to write the full address, and followed by/, many times because this is not set, Not be able to load other modules of the DLL) and then open Extension=php_gd2.dll but if it is installed Iconv.dll, press the method above to open the Php_ After Iconv.dll, still can not open the Iconv module, need the following configuration: A. On the official download site of Iconv below the Windows version of the Iconv file: Libiconv-1.9.1.bin.woe32.zip will extract this file, will bin/ The following charset.dll,iconv.dll,iconv.exe is copied to the c:/windows/(or other system path) (IPAddr reminds you that this step is necessary, php_ Iconv.dll is also called the GNU Iconv Library, so first install the GNU Iconv Library) b. Open the PHP_ICONV.DLLC inside the php.ini. Restart Apache, then in Phpinfo (); detect if Iconv is turned on. Recently in a program, you need to use the ICONV function to fetch the UTF-8 encoded page into gb2312, found that only with the ICONV function to fetch the data of a transcoding data will be less for no reason. Let me depressed for a while, go online a check information to know this is a iconv function of a bug. Iconv error when converting the character "-" to gb2312 the solution is simple, after the code that needs to be converted to "//ignore" is the second parameter of the ICONV function. The following is the reference: Iconv ("Utf-8″," gb2312// IGNORE ", $data) IGNORE means ignoring errors at the time of conversion, and if there are no IGNORE parameters, all strings after that character cannot be saved. This function, Iconv (), is built into the php5. <?phpecho$str= ' Hello, here is the coffee! '; Echo ' <br/> '; Echoiconv (' gb2312′, ' utf-8′, $str);//Transfer the encoding of the string from GB2312 to Utf-8echo ' <br/> '; Echoiconv_substr ($ str,1,1, ' utf-8′ ');//intercept by character count instead of Byte Print_r (iconv_get_encoding ());//Get current page encoding information echoIconv_strlen ($str, ' utf-8′);//Get the set encoding of the string length//also have such use of $content=iconv ("Utf-8″," Gbk//translit ", $content);? > Use caution when using this function for string encoding conversions, it is important to note that if you convert Utf-8 to gb2312, a string truncation can occur. Stringiconv (STRING$IN_CHARSET,STRING$OUT_CHARSET,STRING$STR) when using this function for string encoding conversion, it is important to note that if you convert Utf-8 to gb2312, The occurrence of a string truncation may occur. This can be resolved using the following methods://author:zhxia$str=iconv (' Utf-8 ', "Gb2312//translit", file_get_contents ($filepath)); , which means that if a character that matches the source encoding is not found in the target encoding, a similar character is selected for conversion. This can also be used here://ignore This parameter, which means that characters that cannot be converted are ignored.

Above transfer from: http://baike.baidu.com/link?url= 17s9vgaim-xyyluia00db7eoyuuewwmusp3r6htsrvwsy9bpccqmokbyngpyo32vdbz5nfdxuzzx_ds_c6xvua

Code Show:

1Utfhandle = Iconv_open ("UTF-8","GB2312");2 3 intCharetcov (Char* Srcbuf,intSrclen,Char* Destbuf,intDestlen)4 {5     6     intSrcLen1 =Srclen;7     intDestLen1 =Destlen;8     intErr= Iconv (Utfhandle,&srcbuf, (size_t *) &srclen1, &AMP;DESTBUF, (size_t *) &destLen1);9     if(Err <0)Ten     {   One          AErr =errno; -errno =0; -     } the  -     returnerr; - } -  +Iconv_close (Utfhandle);

C + + uses Iconv for page character conversion

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.