C + + does not support Unicode, even UTF8

Source: Internet
Author: User

Today, using Unicode as a string is a common sense, but it's still a headache for some programming languages with a long history. Without the support of a third-party library, C + + does not actually support Unicode effectively, even if it is UTF8. (Note: This article discusses the encoding scheme of strings in memory, not file or network traffic.) )

When the STL's string template is born, Unicode is still the ideal fixed 16-bit encoding. At that time, Windows, Java and so on successively leap into the Unicode era, and Unix/linux is limited by backward compatibility and difficult to change. At that time, C + + programming on Windows was primarily used with the Win32 API, and the STL was not popular, and Unix/linux basically did not support Unicode. STL's Wstring, just replacing char template parameters with wchar_t, seems perfectly reasonable, and has not been tested in practice. Therefore, wstring on Windows has been in a state that is actually not available, the encoding conversion of various IO is problematic, and the wchar_t on Linux is 32 bits, which is a waste of memory and is not worth using at all. (The latest standards introduce char16_t and char32_t for Unicode, as well as u16string and u32string.) )

Why is the wchar_t on Linux 32 bits? Since GCC is starting to support wide characters, it is about the time the Unicode character set breaks the 16-bit encoding limit. Originally expected very sufficient code position, unexpectedly enough, Unicode had to make adjustments. After tuning, there are three optional Unicode encodings:
UTF32: A code bit fixed 4 bytes, a code bit represents a character encoding.
UTF16: Compatible with the previous 16-bit encoding, a code bit of 2 bytes, but 1 or 2 code bits represent a character encoding.
UTF8: Compatible with ASCII encoding, a code bit of 1 bytes, but 1 to 6 code bits represent a character encoding. (the current standard actually requires a maximum of 4 code bits, but for compatibility reasons, 5, 6 code-bit situations are possible, even if this is invalid encoding.) )
(in addition to this, there are combinations of character combinations and modifiers that make code-to-character mapping more complex and ignored here.) )

The advent of UTF8, so that the Linux system found a new opportunity, since the compatibility of ASCII, so long as the system's code page new plus a UTF8, does not support Unicode it? Naturally, Linux has gone this way. Windows does not support UTF8 's coding pages, which makes many people who write cross-platform programs very dissatisfied with Microsoft. Then, more people claim that Windows, Java,. NET and so on the wrong Unicode encoding, UTF8 is the kingly.

I'd love to have a perfect solution for character coding, but it's a pity that whenever we try to make it simple, it gets more complicated. The most basic problem: a UTF8 encoded char array, just a byte buffer, supported by the standard library of C + +, you simply can't handle it as a string. Want a string length? It can only give you the number of bytes, not the number of characters. Want to take the nth character? It can only give you the nth byte. Want to capitalize a character? Want to tell if a character is a letter? It only accepts characters of type char, that is, only ASCII characters are supported ...

Even if you say to a program that does not need to manipulate characters, we always allocate buffer for the input in C/s + + programs, so how many bytes should be allocated for the expected input of the n-character buffer? Yes, you can only assign it to the maximum possible n*6+1. Temporary allocation of memory OK, if a database field, what do you do? Even utf32 can save more storage space at this point.

Of course, most problems can be solved by using a third-party Unicode library. If you feel that a dedicated Unicode library is too heavy, there are at least a daily boost solution, just replace all string operations with specialized functions ... (if security is important to your program, you should also understand the behavior of the function used when it encounters an invalid encoding, as this is also a way for hackers to break through security checks.) )

However, most programmers do not even know the Unicode encoding, it is more impossible to understand the complexity of the encoding conversion, they are accustomed to C-style byte operations, or from Java, such as the use of double-byte Unicode language, do not word processing software, So there's not much interest in learning a complex Unicode coding system. As a result, UTF8 is more of an expert solution than a common developer in C + +. Ideally, C + + can use its powerful abstraction and encapsulation capabilities to wrap a string class that is truly character-based (Python3 goes this way, but has a lot of critical voices), but it's hard to standardize it in reality.

What about the UTF16 used by Windows, Java,. NET, iOS, and so on? In essence, their support is flawed. Because they initially supported the UTF16 predecessor UCS2, each character fixed 2 bytes, and when the Utf16 appeared, the right when UCS2, that is, the double code bit 4 bytes of characters (Unicode is referred to as a character other than BMP) will be treated as two characters. If your program really wants to correctly support double-bit characters, rewrite the program and use the advanced string function to access the string instead of using the subscript index directly. Just because characters other than BMP are extremely rarely used, programmers do not need to be aware of these details. From the point of view of correctness, this is not correct, but from a practical point of view, it is practical.

Http://blog.csdn.net/nightmare/article/details/39780931#comments

C + + does not support Unicode, even UTF8

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.