File encoding and string literal encoding within a file are two concepts

Last Update:2015-10-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We know that the Calabel string uses the UTF8 encoding,
Usually under Xcode, direct calabel *p = ...; P->settext ("kanji"), the display of Chinese characters is not any problem,
However, in the VS2013, it shows garbled characters.
A lot of people ask in the group, I answer: please use UTF8 code.
He said, my file is in the UTF8 format AH. Oh, it's not that simple okay, file encoding and string literal encoding are two concepts.

A pre-compiled directive for VS2013
#pragma execution_character_set ("Utf-8")
This can be safely in the VS2013 inside P->settext ("kanji"), P->settext ("Chinese")
Since it's just a precompiled directive, it's generally a good thing to put this sentence behind the include group.

VC2010 of C + + source strings in Chinese characters

To figure this out, you need to understand the coding first. But the coding problem is so complicated, it's definitely not open.

I first find an example, such as: "Chinese" Unicode code point/UTF8 encoding/GBK is how much.

First go to this website, enter "Chinese" query the corresponding Unicode code point/UTF8 Code: http://www.mytju.com/classcode/tools/encode_utf8.asp

The code points for Unicode are (decimal): Medium (20013), text (25991). The corresponding UTF8 codes are (16 binary): Medium (E4B8AD), text (E69687).

Then go to the following site, enter the "Chinese" query corresponding GBK code: http://www.mytju.com/classcode/tools/encode_gb2312.asp

GBK encoding 16 binary (GBK) are: Medium (d6d0), Wen (CEC4).

Now we know the UTF8 of "Chinese" and the specific value of GBK encoding. Let's see what VC2010 is dealing with.

1. See the UTF8 Coded code (UTF8_NO_BOM.CPP) without BOM first

UTF8 no bom//file contains characters that cannot be represented in the current code page (936) # include <stdio.h>int main () {const char* str = "Chinese";    for (int i = 0; i < sizeof (str); ++i) {printf ("0x%x", Str[i]&0xff);    } return 0; Output://0xe4 0xb8 0xad 0xe6}

The output is: 0xe4 0xb8 0xad 0xe6. It feels like it's right.

But don't worry: VC compiles a warning message: UTF8nobom.cpp:warning C4819: The file contains characters that cannot be represented in the current code page (936). Save the file in Unicode format to prevent data loss.

The subtext is that you have this code with characters that GBK cannot represent, please save it in Unicode mode. VC simply did not put the code (UTF8nobom.cpp) as UTF8,VC just treat it as GBK.

So why did you output the correct results?

Because VC takes (UTF8nobom.cpp) as GBK, and compile time also to convert to local code (also GBK). Therefore, UTF8 encoded "Chinese", by VC as encoded as "0xe4 0xb8 0xad 0xe6" other Chinese processing. VC already do not know "0xe4 0xb8 0xad 0xe6" is the corresponding "Chinese" literal value.

However, in the process of GBK (actually UTF8 without BOM), some UTF8 encoded characters were found to be not a reasonable way of GBK expression, so the C4819 compiler warning appeared.

2. See how the UTF8 with the BOM is handled (UTF8_WITH_BOM.CPP)

UTF8 with bom#include <stdio.h>int main () {const char* str = "Chinese";    for (int i = 0; i < sizeof (str); ++i) {printf ("0x%x", Str[i]&0xff);    } return 0; Output://0xd6 0xd0 0xce 0xc4}

The compilation has no warning, but there is a problem with the output: 0xd6 0xd0 0xce 0xc4.

The source file is clearly UTF8 encoded format "0xe4 0xb8 0xad 0xe6", How to become "0xd6 0xd0 0xce 0xc4" (This is GBK code)?

This is the VC privately do good: it clever to convert UTF8 source code to GBK processing!

Why does VC have to do such a foolish thing?

The reason is to be compatible with the old VC version. Because the previous VC can not handle the UTF8, are processed with local encoding.

3. To see how the real GBK is handled (gbk.cpp)

Gbk#include <stdio.h>int Main () {const char* str = "Chinese";    for (int i = 0; i < sizeof (str); ++i) {printf ("0x%x", Str[i]&0xff);    } return 0; Output://0xd6 0xd0 0xce 0xc4}

Without compilation errors, the output is also consistent with the source code: "0xd6 0xd0 0xce 0xc4".

Because the source file is gbk,cl at compile time GBK conversion to GBK, does not change the string.

But now a lot of people don't want to use GBK (because they can only be used in China, they can't represent global characters).

Here, you can briefly summarize:

VC Editor and VC compiler is 2 concepts, VC Editor support UTF8 does not mean that the VC compiler also support UTF8
VC editor starting from 2008? Support for UTF8 with BOM (temporary without BOM, because local encoding conflicts)
VC compiler starting from 2010 important can support UTF8 (although the support style is very not elegant)

4. See how VC2010 handles UTF8 with BOM (utf8_with_bom_2010.cpp)

VC2010 important added UTF8 support for compilation (#pragma execution_character_set ("Utf-8")), see:

Http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/2f328917-4e99-40be-adfa-35cc17c9cdec

UTF8 with BOM (VC2010), this sentence is the focus!    #pragma execution_character_set ("Utf-8") #include <stdio.h>int main () {const char* str = "Chinese";    for (int i = 0; i < sizeof (str); ++i) {printf ("0x%x", Str[i]&0xff);    } return 0; Output://0xe4 0xb8 0xad 0xe6}

Without compilation errors, the output is also consistent with the source code: "0xe4 0xb8 0xad 0xe6".

UTF8 encoding, UTF8 output. Perfect!

VC editor starting from 2008? Support for UTF8 with BOM (temporary without BOM, because local code conflicts)---Cause:
Microsoft uses the BOM in UTF-8 because it separates the UTF-8 from the ASCII encoding. BOM byte sequence in utf-8 is actually meaningless, then UTF-16 or UTF-32 inside has meaning.

Reprint: https://m.oschina.net/blog/161676

http://www.zhihu.com/question/20167122

File encoding and string literal encoding within a file are two concepts

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

File encoding and string literal encoding within a file are two concepts

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

File encoding and string literal encoding within a file are two concepts

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support