Character set and MySQL character set processing (i)

Last Update:2014-09-30 Source: Internet

Author: User

Tags locale

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the character set summary

In fact, most of the knowledge in this article has been very clear. Here is just a talk about their own sentiment.

1. Although UTF-8 begins with UTF (Unicode transfermation format), he is not Unicode in real sense. He was re-coded on the UCS. Moreover, this is a variable-length encoding method.

2. According to the article, at the same time as the ISO development of UCS (Universal Character set), another co-organization of manufacturers is also working on the development of such a code, called Unicode, and then two teamed up to develop a unified code, but each published their own standard documents, So the UCS encoding and Unicode code are the same.

3. When we use the "locale–a" command to view all available character sets under Linux, the descriptions that appear are all native ANSI, because they are all superset of standard ANSI.

4. When we store files in a UTF-8 manner under Windows, Windows writes three bytes of a signature called UTF-8 at the very beginning of the file, which means that the file is encoded in UTF-8 rather than other character sets. Since UTF-8 is a variable-length encoding, this makes it easier for us to differentiate between the fact that when a file has only the characters specified in the standard ANSI character set, we also consider this to be a UTF-8 encoding, so that if CJK (Chinese,japan, Korea) inside the kanji, it can be successfully decoded. Also, because of it, when we save in Windows as UTF-8 format, direct ftp to Linux to compile, even if the GCC-finput-charset=utf-8 set, there will be a compilation error (because there are not known characters, the specific performance of this).

5. In general, the files that we store (mainly the configuration files, code files) need to use UTF-8 encoding. On the one hand, because the default finput-charset option for GCC is utf-8, on the other hand utf-8 encoding supports all characters (Chinese, English). *. This is how the Java file is stored.

6. Notice that a proper noun is called "C Locale", which is the default character set after the general C program enters main (can be viewed by setlocale (Lc_all, NULL)). This character set is the ANSI character set, because all C programs support this character set, and with this character set can run the program, so give a name, called "C Locale".

Second, on the GCC support for the character set

The content described here mainly refers to this article and man gcc. Make a summary here. There are several key character sets for the entire compilation process.

Character set a for code files
Character Set B (UTF-8) for GCC internal processing
The character set C for the binary file of the GCC output (default is UTF-8, which can be specified using the-fexec-charset option)

Specifically, when we use the GCC command to compile a code file, it directly considers that the character set of the code file is the character set specified in the Finput-charset option (by default, utf-8), which means that he may not even know that your character set is a character set. He transcoded the character set B used in his own internal according to the character set in the Finput-charset option. After compiling, the binary output is converted from the internal character set B to character set C again. As shown,

Students who have learned the principles of compiling should know that the most in binary files should be instructions, so what is the need to use character set C to represent? Of course we hard-coded strings. Direct grafting here for example, if you have the following code,

#include <stdio.h>int main (void) {printf ("Hello \ n"); return 0;}

And we assume that the source file is UTF-8 encoded, and we also use the default Gcc–fexec-charset. Then this "hello" needs to be encoded using character set C, which can be viewed by command

$ OD-TC nihao.c 0000000   #   i   n   c   L   u   d   E       <   s   T   d   i   o   . 0000020   h   >  \ n  \ n   i   n.   t       m   a   i   n   (   V   o   i0000040   d   )  \ n   {  \ n  \ t   p   r   i   n   t   F   (   "344 2750000060 345 245 275   \   n   "   )   ;  \  t \   r   e   t   u   r0000100   n       0   ;  \ n   }  \n0000107

where octal 344 375 240 (hex e4 bd a0 ) is the UTF-8 encoding of "You", octal 345 245 275 (hex e5 a5 bd ) is "good".

The conclusion is that we should try to use UTF-8 to write our source program, so that we can not explicitly set up-finput-charset and-fexec-charset, easy to transplant.

Third, program operation and character set

1. What exactly does printf ("%s") do? Reference documentation.

When we use printf ("%s") in the program, he is actually putting the first address of the string into the device file of the current terminal, which is the byte at the first "X". If the driver of the current terminal can identify the UTF-8 encoding to print out the Chinese characters, if the driver of the current terminal does not recognize the UTF-8 encoding (such as the general character Terminal), it will not print the Chinese characters. That is, the work of recognizing Chinese characters, like this program, is neither done by the C compiler nor by libc the The C compiler UTF-8 encoded in the source file (assuming that the source file is UTF-8 encoded and not otherwise specified-finput-charset) to the target file, libc just as a string ending in 0 is given to the write kernel intact. The work of recognizing Chinese characters is done by the driver of the terminal.

I thought the terminal would do the transcoding for us because we set the "LANG" environment variable (he would instead set the Lc_all), so we did the following experiment.

#include <iostream> #include <locale.h>using namespace std;  int main (int argc, char **argv) {    string s = "Hello";    cout << s << endl;    Char buff[10] = "Hello";    for (int i = 0; i < i++)    {        printf ("%2x", Buff[i]);    }    cout << Endl;          return 0;}

Now I guarantee that the binary of the output is UTF-8 encoded. Experiment as follows,

As you can see, the output is independent of the character set of the current terminal. Because the output of the word stream from the program is "You" and "good" UTF-8 encoding, so the device driver to the correct parsing.

2. What is setlocale () to do?

In doing the above experiment, actually I also to "the terminal character set for input and output will transcode" and expect. So in the code, I also deliberately tried the medium setlocale (Lc_all, "xxxx") call, try to see the results. But the result is always the same as above said, always does not appear garbled output.

After an analysis and a friend's nudge, setlocale and wcstombs (wide string conversion to multibyte strings), MBSTOWCS (multi-byte string conversion to wide string) are finally combined.

To clarify this, you must first start with a lenient string (Wide-character string) and a multi-byte string (multibyte string). Why introduce a wide string, here is a better argument. Excerpt below.

"The most fundamental reason is that the strings under ANSI identify the end of the string with ' Unicod ' (the" \0\0 "bundle), and the correct operation of many string functions is based on this. And we know that in the case of wide characters, a character in memory to occupy a word space, this will make the operation of the ANSI character string function does not operate correctly. Take the "Hello" string for example, under the wide character, its five characters are:

0x0048 0x0065 0x006c 0x006c 0x006f
In memory, the actual arrangement is:

6c 6c xx xx (here should be the source code is UTF-8 encoded, so high is 00--aicro note)

Thus, the ANSI string function, such as strlen, after encountering the first 48 00 o'clock, will think the string to the end, with strlen to the width of the string length of the result will always be 1! ”

In fact, the wide string mentioned here is what we usually call the Unicode string, which is UCS-2.

Mbstowcs is to be, vice versa is wcstombs.

Specific workflow for MBSTOWCS

We first learned through the online conversion tool that the "hello" of the two characters of the wide-character (Unicode) representation is \u4f60\u597d.

The following programs are compiled using GCC, using various default options.

#include <locale.h> #include <iostream> #include <stdlib.h> #include "string.h" int main () {    char* Source = "Hello";        SetLocale (Lc_all, "Zh_cn.utf8");        Gets the length    size_t wcs_size = mbstowcs (NULL, source, 0);        Request memory and initialize    wchar_t* dest = new Wchar_t[wcs_size + 1];    Wmemset (dest, L ' m ', wcs_size + 1);         To convert a multibyte string to a wide character, note that the third parameter is a byte number    mbstowcs (dest, source, strlen (source) * sizeof (char));        Verify    for (int i = 0; i < wcs_size; i++)    {        printf ("%2x", Dest[i]);    }        printf ("\ n");        return 0;}

The output is 4f60 597D. It can be seen that the process of mbstowcs action is

The point is that mbstowcs the lc_ctype as the source (multi-character string) encoding.

Specific workflow for Wcstombs

We first learned through the online conversion tool "Hello" the GBK encoding of the two characters is C4 E3 BA C3.

Experimental procedures

#include <locale.h> #include <iostream> #include <stdlib.h> #include "string.h" int main () {char*        Source = "Hello";        SetLocale (Lc_all, "Zh_cn.utf8");        Gets the length size_t wcs_size = mbstowcs (NULL, source, 0);    Request memory and initialize wchar_t* dest = new Wchar_t[wcs_size + 1];         Wmemset (dest, L ' m ', wcs_size + 1);        To convert a multibyte string to a wide character, note that the third parameter is a byte number mbstowcs (dest, source, strlen (source) * sizeof (char));        Back to GBK encoded setlocale (lc_all, "ZH_CN.GBK");        Get length size_t mbs_size = wcstombs (NULL, dest, 0);    Request memory and initialize char* buf_mbs = new Char [mbs_size + 1];        memset (Buf_mbs, ' mbs_size + 1);        Wide string conversion to multibyte strings, note that the third parameter is a byte number wcstombs (Buf_mbs, dest, wcs_size * sizeof (wchar_t));    Verify the for (int i = 0; i < mbs_size; i++) {printf ("%2x", Buf_mbs[i]);            } printf ("\ n"); return 0;}

The output is the same as expected. This illustrates the process of wcstombs.

The point is that wcstombs lc_ctype as a destination (multi-character string) encoding.

Summary: setlocale needs to be used in conjunction with the wcstombs domain mbstowcs. According to the article, the program is usually encoded with wide characters when doing internal calculations, and multibyte-coded if it is to be saved or exported to another program, or sent to another program over the network. This reminds me of the original reading of the fourth edition of "Windows via C + +", as if the second chapter on this issue, has not been practiced, so forget.

Character set and MySQL character set processing (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More