Iconv encoding conversion Principle

Source: Internet
Author: User
Tags new set
(15:43:20)
Reprinted Tag:
Iconv category: PhP topics

Generalized Character Set conversion: iconv
In the previous chapter, we have seen two sets of MBS and the conversion function between the WCS. The first group cannot control the status of the string for the encoding system MBS with "status change, therefore, it is not suitable for converting the MBS, while the second group can directly control the status of the string, so the use range is wider. However, these two string conversion functions have great restrictions in some use cases. In a broad sense, they all belong to the character set conversion function 」, however, they are directly associated with the i18n and locale mechanisms, that is, before using them, the program must set the correct locale. Therefore, it is inconvenient to use them in the following situations, or even not feasible:
"If the program needs to convert the-character set encoding and B-character set encoding, unfortunately these two types of word set encoding are not currently used by the program locale. If you use the aforementioned conversion function between the WCS and MBS, the only way is to call setlocale () first, and switch the lc_ctype locale of the program to the locale encoded with the word set a first, after converting string a to the WCS string, call setlocale () again to switch the locale of the program to the locale encoded by the word Set B. Then, convert the WCS string to string B. In case no locale can be found to use word set a or word Set B, this is useless.
"If the program needs to convert multiple character sets at the same time, the conversion functions of these WCS and MBS cannot be implemented. The reason is that once setlocale () sets the locale of the program, the effect is spread throughout every part of the program. We cannot set locale A when converting the word set encoding, and ''at the same time'' set locale B to convert the word set encoding B.
Therefore, we need a more general word set conversion system, a conversion system that can be completely irrelevant to locale, In order to easily meet the above requirements. Therefore, in the XP G2 standard, a new set of function interfaces are defined: iconv. Actually, in glibc, on the surface, the conversion functions of the WCS and MBS are not the same as those of iconv, but the conversion between the underlying WCS and MBS is achieved by iconv. Therefore, iconv is the most basic function interface in the word set conversion system.
The iconv word set conversion system has only three functions. In many systems, it is declared in the iconv. h. The usage is the same as that of reading and writing files. First, ''enable '', then ''operation'', and then ''close '', these functions include:
"Iconv_t iconv_open (const char * toenc, const char * fromenc)
"Size_t iconv (iconv_t CD, const char ** inbuf, size_t * inbytesleft, char ** outbuf, size_t * outbytesleft)
"Int iconv_close (iconv_t CD)
First, the iconv_open () function is used to perform the ''enable ''action, that is, when we want to convert the encoding system A to the encoding system B, we must call this function first, set fromenc to the name of encoding system A and toenc to the name of encoding system B. In this case, this function will perform operations similar to file opening, return a Data Structure iconv_t that represents the conversion pipeline for future use. In fact, in system implementation, iconv_open () is treated as ''open archie'', so it is limited by the number of files that can be enabled in the current system or in the same itinerary, if too many files have been enabled for the system or other parts of the program, it is possible that iconv_open () will fail.
If iconv_open () fails to be enabled, it returns the value (iconv_t)-1 and sets the errno global variable to indicate the cause of the failure. The reason for the failure is that the number of opened files has exceeded the upper limit, or the system memory is insufficient, or the system itself cannot achieve conversion between the encoding system A and B. Interested readers can directly read info libc, * Character Set handling: This section describes the values of errno and their meanings.
The second iconv () function is used to convert the actual encoding system. It must call iconv_open () and obtain the iconv_t structure before it can work. As long as iconv_open () can be enabled successfully, it can theoretically perform conversion, regardless of whether its source encoding and target encoding are MBS strings or the WCS string, or the mutual conversion between the two, it does not matter. Note that if there is a WCS string, even if the wchar_t * form is used to store the WCS string, it is still processed by char * during the input of this function.
If you want to convert string a to string B, string a is passed in through * inbuf, and * inbytesleft is passed in to the length of array A, which is calculated by the number of bytes. The conversion result is returned by * outbuf, and the same * outbytesleft is the length of * outbuf. If the conversion is successful, * inbuf is located at the end of the array storing string a, and * inbytesleft is set to the number of bytes remaining in the array. The same is true for * outbuf and * outbytesleft. Therefore, we can repeatedly call this function in the same array of A and B. For example, if an invalid character in array a cannot be converted during the conversion process, * inbuf and * outbuf will stop at the position where the conversion fails, if the converted result is returned, you can skip the invalid character, continue to convert the remaining part, or perform other special processing... and so on.
In many cases, iconv () conversion may fail. For example, if an invalid character is encountered in array, or a character in a cannot find the corresponding word in B. The second case is most likely to happen, and the current glibc-2.1.x system encounters this situation as a conversion failure without further processing. Of course this is not the best way to handle it, so it may change in the future. According to the description in info libc, if the conversion fails, the final return value of iconv () is (size_t)-1, and the errno value is set to indicate the cause of the failure. If the conversion is successful, the return value is the number of successfully converted characters. But our actual test results, in the glibc-2.1.3 system, if the return value is always 0 after the conversion is successful, it is a bit strange, I don't know if it's a glibc bug?
Iconv () can also work correctly if the encoding system contains "status changes, it can record and update the current status of the string at any time during the conversion process (it should be recorded in the iconv_t CD structure ), therefore, even if you use the "installment payment" method to cut the same string into several shard segments for conversion, there will be no problem. However, before using iconv () for the first time, you must first initialize the status of strings a and B, just as if we mentioned mbsrtowcs () in the previous step, before use, you must initialize the configuration so that subsequent operations can work normally. For iconv (), the initialization method is to set * inbuf and * outbuf to null when calling it.
The last iconv_close () function is used to "close the file" after the entire conversion.
Below we have written an example program to illustrate how to use the iconv interface:
# Include <stdio. h>
# Include <string. h>
# Include <iconv. h>

Int main (INT argc, char ** argv)
{
File * fin, * fout;
Char * encfrom, * encto;
Char bufin [1024], bufout [1024], * sin, * sout;
Int mode, Lenin, lenout, RET, nline;
Iconv_t c_pt;

If (argc! = 5 ){
Printf ("Usage: A. Out <encfrom> <encto> <fin> <fout> \ n ");
Return 0;
}
Encfrom = argv [1];
Encto = argv [2];
If (fin = fopen (argv [3], "RT") = NULL ){
Printf ("cannot open file: % s \ n", argv [3]);
Return-1;
}
If (fout = fopen (argv [4], "WT") = NULL ){
Printf ("cannot open file: % s \ n", argv [4]);
Return-1;
}

If (c_pt = iconv_open (encto, encfrom) = (iconv_t)-1 ){
Printf ("iconv_open false: % s ==> % s \ n", encfrom, encto );
Return-1;
}
Iconv (c_pt, null );

Nline = 0;
While (fgets (bufin, 1024, fin )! = NULL ){
Nline ++;
Lenin = strlen (bufin) + 1;
Lenout = 1024;
Sin = bufin;
Sout = bufout;
Ret = iconv (c_pt, & sin, & Lenin, & sout, & lenout );
Printf ("% s-> % s: % d: ret = % d, len_in = % d, len_out = % d \ n ",
Encfrom, encto, nline, RET, Lenin, lenout );
If (ret =-1 ){
Printf ("stop at: % s \ n", sin );
Break;
}
Fprintf (fout, "% s", bufout );
}
Iconv_close (c_pt );
Fclose (FIN );
Fclose (fout );
Return 0;
}
This program can input the name of the Encoding System of the source file and the target file from the command column to convert the content of the source file to the target file. You may notice that in the program, we did not perform locale settings or other related actions at all. The reason is that it is not necessary. The iconv function interface is completely irrelevant to locale. At the same time, our program is only suitable for the conversion between two MBS encoding systems. It is not suitable for the conversion of one of the MBS encoding systems, the reason is that we didn't use the wchar_t * array in the program, especially for the WCS. We also mentioned that the WCS string cannot be used for file output, at present, our program directly performs file encoding conversion.
What about this kind of transcoding program being "durable? Interested readers can follow us to test. Since we have previously introduced how to install the zh_tw.big5 and zh_cn.gb2312 locale environments on your GNU/Linux systems, if you have installed them correctly, theoretically, the big5 and gb2312 gconv systems of your system are not a problem (in glibc, gconv systems can be said to be the heart of iconv, which will be detailed in the next section ), therefore, we will take big5 and gb2312 encoding systems as examples to test our program. First, please first prepare a file called f-big5, the content of the big5 encoding into such a line of content:
I am a graduate student
Then, compile our sample program. Assume that the program execution file name is A. Out, And then execute:
A. Out big5 gb2312 f-big5 output
In the glibc-2.1.x system, you should be able to convert correctly, then you should see the following program output:
Big5-> gb2312: 1: ret = 0, len_in = 0, len_out = 1012
If RET is set to 0, the conversion is successful. Then, you can open another terminal that can read gb2312 encoding to view the output file, you will see the original f-big5 file content has been converted into gb2312 code.
It seems perfect. Let's test it further. Now please add some content in the f-big5 file, like this:
I am a graduate student
Currently conducting research and testing
Run the preceding command again. The following result is displayed:
Big5-> gb2312: 1: ret = 0, len_in = 0, len_out = 1012
Big5-> gb2312: 2: ret =-1, len_in = 6, len_out = 1008
Stop at: Test
There is no problem in the first line, but the second line fails. The program points out that the last two words '''test'' in the second line cannot be converted. Why?
To explore the reasons, we need to go deeper into the iconv Encoding System Conversion principle. We have mentioned many times that the conversion between MBS and the WCS is actually to convert MBS into the base word set of the system, and the conversion principle of iconv is also the same. In fact, in the glibc system, there are two ways to convert the encoding system A to B: first, if iconv has built a table corresponding to a and B, this table is used for conversion, which is also the most reliable conversion method. In case it cannot find the appropriate table, it will first convert a to the base word set, and then from the base word set to B. That is, the base word set of the system plays the role of intermediate media.
It should be noted that in other UNIX systems, they may only provide the first way for conversion, while glibc provides the second way for reasons, this ensures that the encoding of any two word sets has the opportunity to be converted, as long as the encoding of these two word sets is supported by the system at the same time. Moreover, because the "base word set" selected by glibc theoretically contains all the computer word sets currently in use in the world, therefore, it is reasonable and feasible to use the base word set as the intermediate medium.
However, the encoding of various word sets is a complicated thing. Sometimes, This set is also very tricky. Just as the ''test'' of big5 cannot be converted into gb2312, we will know the reason for doing the following experiments. Because glibc uses ucs4 as the base word set of its system, and ucs4 in the form of MBS utf8 encoding system, so we are now preparing two files, one is the f-big5, the character ''test'' contains the big5 code, and the character F-GB contains the gb2312 code ''test. Now we use the above sample program to convert both files into utf8 encoding:
A. Out big5 UTF-8 f-big5 output1
A. Out gb2312 utf8 F-GB output2
Both can be converted successfully. Then, we use the diff program to compare the content of output1 and output2, and find that their content is different!
This is why the conversion fails! Because the two words with the same meaning in Chinese are mapped to different ucs4 codes in big5 and gb2312 respectively. Why? This involves more profound questions about the correspondence between ucs4 encoding rules and other encoding systems. Here we will skip this step and come back to discuss it if there is a chance. The problem now is that in glibc, if some words of A and B are mapped to different encodings in the base word set, the conversion may fail. This is often very likely to happen, especially when glibc directly uses ucs4 as its base word set.
In addition, as mentioned earlier, the underlying operating modes of iconv may not be the same in different Unix systems, which leads to more serious problems: even if the system supports both A and B encoding, we cannot guarantee that a can be converted to B. Alternatively, a can be converted to C or B in the system, but it cannot be guaranteed that A can be converted directly (or indirectly) to B.
This is indeed another discouraged question. Now readers may ask, what is the use of iconv? In our opinion, it is useful for converting the code between a small character set and a large character set. For example, big5 and gb2312 are the child character sets of utf8, so we can use iconv to convert big5 or gb2312 to utf8 with confidence, or convert them back in the opposite direction (of course, the premise is that the original utf8 string does not contain characters not included in big5 or gb2312 ). Or big5 is a subset of GBK. We can also use it for conversion.
However, if you want to directly use iconv to convert big5 to gb2312, the only reliable method is to add a table in the iconv system for direct conversion between big5 and gb2312, we also hope that this table can be part of the system standard (at least part of the glibc standard) so that our program can be useful (at least in the glibc system ). So now we will go to the next section to take a look at the core part of glibc's iconv as an end of this chapter.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.