Research on the Chinese code processing of C + + UTF8 in msvc

Last Update:2017-08-03 Source: Internet

Author: User

Tags locale

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

the problem of character encoding, met at the university, has been foggy, not very clear. Recently encountered a problem, want to output Utf-8 encoded Chinese character stream in the C + + control console. Tried many times are garbled, and later spent some time to check the information, and colleagues to exchange a bit, is the C + + on the UTF8 code processing probably touched clear.

Character

First, a noun: character set , did not hear the first Baidu, in fact, is a character encoding format, as we often say ascii,utf8,gbk are commonly used in the character set.

first, be clear, from the editor you enter a UTF8 characters start , to finally show up on the console, the entire process involves three concepts , namely the source character set , executes the character set, parsing the character set.

Explain separately:

source Character Set : is the character set of your source code text file. , if you have nodepad++ A text editor like this you can open a look at your character set, or use windows your source text file is in binary form on the hard drive. Both Chinese and English are the same, when you enter a Chinese character and save the close

execute character Set : in C++ char* str= " i "; str What exactly is the stored byte You might say that the source character set has not already decided this "I" , Yes, But this execution character set is for you to explain it again here. For example, my source character set may be utf8 But I can make the final ptr storage is GBK byte encoding.

parsing character Sets : When you finally want to restore the binary byte encoding that is displayed, it needs to be used. For example, by using printf to display the preceding str to the console, this printf parses the byte encodings according to the parsing character set. , the specified characters are found to be displayed.

a lap, as if not very messy, but there are a lot of holes in it. These character sets are handled by the specific compiler or even the operating system, different compilers differ, I just say Windows7 system under the VS2013 (msvc compiler ) environment.

The concept of character sets in VS2013

1. for The source character set in the advanced save Options , VS2013 file , you can view the source character set that sets the current source code file.

2. for the execution character set , VS2013 by default according to the system Locale to determine the execution character set, generally everyone is windows Chinese system, Locale is China, then is GBK encoding.

3. for parsing the character set, I tried, if not manually changed , VS2013 standard input and output (printf) to the command line is also based on the system Locael decided, that is, GBK .

Case Analysis

now let's analyze it . , if the following source code we save in UTF8 format (no BOM). Analyze the results displayed on the console.

1 Char* str= "Me"; 2 printf ("%s\n", str);

1. First the text of this code file, the "I" character is encoded in E68891 three bytes .

2. When the compiler compiles this code, the execution character set by default is GBK, then the compiler to determine the byte content of STR, the text will be saved in the byte content to GBK, there is a noticeable problem, since to convert to GBK, you need to know from what format to convert to GBK, How does msvc know the source format? Method only one is to analyze your source file has a BOM, if there is a BOM it thinks the original format is the BOM specified format (do not know the BOM can first Baidu a bit), if there is no BOM he thinks your source character set is locale associated. I just said we're using UTF8. No BOM format to save the source files, so the compiler thinks the source text in the "I" is GBK encoding saved.

3. That from GBK to Gbk,msvc will not do any conversion, here is a small problem, to remind that this code should be compiled does not pass, because the GBK Chinese characters are 2 bytes, and UTF8 is three bytes, so the compiler in order to dine will be "I" word after the double quotes to eat, Turned into two GBK Chinese character coding e688,9122(22 is the UTF8 encoding of quotation marks), without quotation marks the compiler will error, the simplest solution is to add a Chinese character in the back to an even number is no problem.

4. after the program runs , printf output to the console, this time the use of the analytic character set is also GBK , will be used in memory E688,9122 to in the GBK Character set, find the corresponding encoded Chinese character "contact ? " ". Of course it's wrong.

The character encoding can go to this website to inquire http://www.mytju.com/classcode/tools/encode_utf8.asp

Solution Solutions

This is my first mistake, since I know the problem, how to change it, in order to let The " i " word in the UTF8 encoded source file can be displayed on the command line, and we need to perform the following analysis:

1. First of all must be compiled at the time to let Str byte content is UTF8 format, it is necessary to make the execution character set is UTF8, before the msvc execution character set is determined by the locale, it is impossible to change, But Microsoft later made a small uncertainty by adding a preprocessing #pragma execution_character_set ("Utf-8"). To tell the compiler to execute the character set to UTF8.

2. Compile time to convert to the execution character set need to know the source character set, before we are not with the BOM, which led to msvc think our source file is GBK encoded, but in fact we are UTF8 code, which requires us to save the source of the time instead of using UTF8 with the BOM format. This will not be a problem.

3. The last to be shown, Since the memory is UTF8 encoding, parsing must also be in accordance with the UTF8 format to parse, so we want to set the default parsing character set from GBK to UTF8, the simplest way is to call system (" CHCP 65001 ") ;

It's supposed to show up normally　　 . UTF8 characters, but there is a problem is if STR with cout output, is still garbled, this may be because cout has its own parsing character set, will not change with the CHCP command. This need to study, which students know, can leave a message to tell me. A little bit more #pragma execution_character_set ("Utf-8") this preprocessing is no longer needed in c++11 , c++11 can specify the string literal of the execution character set,U8 "I". It's so simple. But vs2013 does not support this feature. This article is not about how to output a C + + literal in a UTF8 format to the console. Instead, this is an example of how MSVC C + + handles UTF8 characters.

Respect the wisdom of others, if reproduced, please specify the author Esfog, the original address http://www.cnblogs.com/Esfog/p/MSVC_UTF8_CHARSET_HANDLE.html

Research on the Chinese code processing of C + + UTF8 in msvc

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More