C + + does not support Unicode, even if UTF8

Source: Internet
Author: User

Today, string Unicode we don't need common sense reasons, but some have a long history of programming languages. This is still a headache.

Despite the assumptions supported by third-party libraries, C + + In fact does not really support Unicode effectively. Even UTF8. (Note: This article discusses the string encoding scheme in memory, the network data flow.) )

When the STL's string template is born, Unicode is still the ideal fixed 16-bit encoding. Then. Windows, Java, and so on successively leap into the Unicode era, and Unix/linux is limited by backward compatibility and difficult to change. At that time, C + + programming on Windows was primarily used with the Win32 API, and STL was not popular. Unicode is generally not supported on Unix/linux. The STL's wstring, which simply replaces the Char template reference with wchar_t, seems perfectly reasonable, and in fact has not been tested in practice. As a result, wstring on Windows has been in a state that is actually unavailable, and there are problems with encoding conversions for various IO. The wchar_t on Linux is 32 bits, which is a waste of memory and is totally unworthy of use. (The latest standards introduce char16_t and char32_t for Unicode, as well as u16string and u32string.)



Why is the wchar_t on Linux 32 bits? Since GCC started to support wide characters, it was about the time that the Unicode character set exceeded the 16-bit encoding limit.

Originally expected very sufficient code, unexpectedly not enough. Unicode has to make adjustments. After tuning, there are three optional Unicode encodings:
UTF32: A code bit fixed 4 bytes, a code bit represents a character encoding.


UTF16: Compatible with the previous 16-bit encoding, a code bit of 2 bytes, but 1 or 2 code bits represent a character encoding.
UTF8: Compatible with ASCII encoding, a code bit of 1 bytes, but 1 to 6 code bits represent a character encoding. (the current standard actually requires a maximum of 4 code bits.) However, for the sake of compatibility, 5, 6 code-bit situation is possible, even if this is invalid encoding. )
(in addition to this, there is the case of combination of character combinations and modifiers.) Makes the code-to-character mapping more complex. ignored here.



The advent of UTF8 has allowed the Linux system to discover new opportunities. Since the compatibility of ASCII, then only the system's code page to add a UTF8, do not support Unicode it?

Quite naturally, Linux has gone this way.

Windows, however, does not support UTF8 encoding pages. This makes very many people who write cross-platform programs very dissatisfied with Microsoft. And then, more people claim. Windows, Java,. NET and so on the wrong Unicode encoding, UTF8 is the kingly.

I very much hope that the character encoding has a perfect solution, but unfortunately, whenever we try to make it simple, it becomes more complex. The main problem: a UTF8 encoded char array. Just a byte buffer, you can't handle it as a string at all, with the support of the standard library of C + +.

Want a string length? It can only give you the number of bytes, not the number of characters.

Want to take the nth character? It can only give you the nth byte. Want to capitalize a character? Want to infer if a character is a letter? It only accepts char type characters, that is, only ASCII characters are supported ...

Even for a program that does not need to manipulate characters, we always allocate buffer for the input in C/s + + programs, so we expect to enter a buffer of n characters. How many bytes should be allocated? Yes, you can only assign it to the maximum possible n*6+1. Temporary allocation of memory OK. What do you do if you have a database field? Even utf32 can save more storage space at this point.



Of course, most problems can be solved by using a third-party Unicode library. Suppose that the specialized Unicode library is too heavy, at least there is a daily solution to boost. Just replace all string operations with specialized functions ... (Assume that security is important to your program.) You also need to understand the behavior of the function used when it encounters an invalid encoding. Because this is also a hacker to break through security checks a means. )

However, most program apes do not even know the Unicode encoding. It is more impossible to understand the complexities of encoding conversions, they are either accustomed to C-style byte operations, or languages that use double-byte Unicode, such as Java. It's not a word-processing software, so there's not much interest in learning a complex Unicode coding system. As a result, UTF8 is more of an expert solution than a common developer in C + +. Ideally, C + + is able to wrap up a truly character-based access string class with its powerful abstraction and encapsulation capabilities (Python3 goes this way.) But there is a lot of criticism, but in reality it is very difficult to standardize it.

What about the UTF16 used by Windows, Java,. NET, iOS, and so on? In essence, their support is flawed.

Since they started with the support in fact the predecessor of Utf16 UCS2, each character is fixed at 2 bytes. And when Utf16 appears. Right when UCS2 is used, meaning that a double-bit 4-byte character (a character other than a BMP in Unicode) is treated as two characters.

Assuming your program really wants to correctly support double-bit characters, change the code and use the advanced string function to access the string instead of using the subscript index directly. Just because the characters outside the BMP are extremely rarely used, the program apes and we don't need to know these details.

From the perspective of correctness, this is not. But a useful point of view is very useful.

Copyright notice: This article blog original articles, blogs, without consent, may not be reproduced.

C + + does not support Unicode, even if UTF8

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.