C ++ does not support Unicode, even utf8, unicodeutf8

Source: Internet
Author: User
Tags object serialization

C ++ does not support Unicode, even utf8, unicodeutf8
So far, unicode is a common sense, but it is still a headache for some programming languages with a long history. Without the support of third-party libraries, C ++ does not actually effectively support unicode, even utf8. (Note: This article discusses the encoding scheme of strings in memory, rather than file or network data streams .)

When the string template of STL was born, unicode was ideally fixed with 16-bit encoding. At that time, Windows, Java, and so on had successively crossed the unicode era, while Unix/Linux was difficult to change due to backward compatibility. At that time, C ++ programming on Windows mainly used Win32 APIs. STL was not popular yet, while Unicode was basically not supported on Unix/Linux. The STL wstring only replaces the char template parameter with wchar_t, which seems completely reasonable and has not been tested in practice. Therefore, wstring on Windows has been in the state of Actually unavailable, and there is a problem with encoding conversion during various IO; while wchar_t on Linux is 32-bit, it is a waste of memory, so it is not worth using. (The latest standard introduces char16_t and char32_t for unicode, as well as u16string and u32string .)

Why is wchar_t 32-bit in Linux? When gcc starts to support wide characters, it is about the time when the unicode Character Set exceeds the 16-bit encoding limit. It was originally expected that there were ample bitwise codes, and unexpected ones were not enough. unicode had to be adjusted. After adjustment, there are three optional unicode encoding types:
Utf32: a fixed bitwise is 4 bytes. A bitwise represents a character encoding.
Utf16: compatible with the previous 16-bit encoding. One code bit is 2 bytes, but one or two code bits represent one character encoding.
Utf8: compatible with ASCII encoding. One bitwise is 1 byte, but one to six bitwise represents one character encoding. (The current standard requires a maximum of four bitwise codes. However, for compatibility reasons, five or six bitwise codes may occur, even if this encoding is invalid .)
(In addition, the combination of character combinations and modifiers makes the ing between code bit and character more complex, which is ignored here .)

The emergence of utf8 has brought new opportunities to the Linux system. Since it is compatible with ASCII, it does not support unicode as long as the encoding page of the System adds a new utf8. Naturally, Linux has taken this path. Windows does not support UTF-8 encoding pages, which makes many people who write cross-platform programs very dissatisfied with Microsoft. Later, some people claim that unicode encoding is wrong for Windows, Java, and. NET, and utf8 is the king.

I really want a perfect solution for character encoding, but it is a pity that it becomes more complicated whenever we try to make it simple. The most basic problem: A utf8-encoded char array is just a byte buffer. With the support of the C/C ++ standard library, you cannot treat it as a string. What is the length of a string? It can only give you the number of bytes, not the number of characters. Want to take the nth character? It can only give you the nth byte. Do you want to convert a character to uppercase? Want to determine whether a character is a letter? It only accepts char characters, that is, it only supports ASCII characters ......

Even for a program that does not require any operation Characters, we always need to allocate a buffer for the input in the C/C ++ program. Therefore, we expect to input a buffer of n characters, how many bytes should be allocated? Yes, you can only allocate resources at the maximum of n * 6 + 1. It's okay to temporarily allocate memory. What if a database field is used? In this case, even utf32 may save storage space.

Of course, most problems can be solved by using a third-party unicode library. If you think the specialized unicode library is too heavy, at least there is a daily boost solution. You only need to replace all string operations with specialized functions ...... (If security is important to your program, you should also understand the behavior of the functions used in case of invalid encoding, because this is also a way for hackers to break through security checks .)

However, most programmers do not even know unicode encoding, but do not even know the complexity of encoding conversion. They are used to C-style byte operations or use double-byte unicode languages such as Java, it is not a word processing software, so I am not very interested in learning complicated unicode coding systems. Therefore, utf8 is more like an expert solution in C/C ++ than for common developers. Ideally, C ++ can use its powerful abstraction and encapsulation capabilities to encapsulate a character-based string class (Python3 follows this path, but there is a lot of criticism), but in reality it is difficult to standardize it.

So what about utf16 for Windows, Java,. NET, and iOS? In essence, their support is flawed. They started to support UCS2, the predecessor of utf16. Each character is fixed to 2 bytes. When utf16 appears, it is used as UCS2, that is to say, a 4-byte double-coded character (a character other than BMP in unicode) will be treated as two characters. If your program really wants to correctly support double-digit characters, you need to rewrite the program and use advanced string functions to access strings, instead of directly using subscript indexes. It is only because the characters other than BMP are extremely unavailable that their programmers do not need to understand these details. From the perspective of correctness, this is not correct, but from the practical perspective, it is still practical.
What is the association or difference between Unicode and UTF-8/UTF-16?

UTF is Unicode Translation Format, which means converting Unicode to a certain Format. Characters (whether Latin letters, Chinese characters, or other characters or symbols) defined in Unicode are stored in 2 bytes. Characters defined in the secondary plane are stored in two 2-byte values in the form of a proxy pair (surrogate pair.

Unicode is an encoding method, which is the same as ascii, while UTF is a storage method (format ).

In the jvm, the characters (strings) are encoded in unicode mode when the VM manages data (in memory) or when the object is serialized.
However, in jvm, characters (strings) are stored in char format. A char occupies 2 bytes (for example, you can define char c = 'word '), is "word" and "Z" is also accounted for 2 bytes; and after Object serialization, the object is stored in UTF-8, a Chinese occupies 2 bytes, english and numbers only occupy one byte. For more information, see the following link.

As a result, objects after serialization only occupy about half of the space at ordinary times (when it is all Chinese, it occupies the same space; when it is all English, unicode occupies twice the space of the UTF-8 ).

The advantage of UTF-16 over UTF-8 is that most characters are stored in fixed-length bytes (2 bytes), but UTF-16 is not compatible with ASCII encoding.
Reference: blog.csdn.net/...6.aspx

What are the differences and associations between unicode UTF-8 UTF-16?

Unicode:

The encoding mechanism developed by unicode.org should include common texts all over the world.
In 1.0, It is a 16-bit code, from U + 0000 to U + FFFF. each 2byte Code corresponds to one character. At the beginning of 2.0, the 16-bit limit was abandoned. The original 16-bit is used as the basic bit plane, and the 16-bit plane is added, which is equivalent to 20-bit encoding, the encoding range is 0 to 0x10FFFF.

UTF: Unicode/UCOS Transformation Format

UTF-8, 8bit encoding, ASCII do not change, other characters do Variable Length Encoding, each character 1-3 byte. Usually used as an external code. has the following advantages:
* It is irrelevant to the CPU byte sequence and can communicate with each other on different platforms.
* High Fault Tolerance. If any one byte is damaged, only one encoding bit will be lost at most, and no chainlock error will occur (for example, if one byte is incorrect, the entire line will be garbled)

UTF-16, 16-bit encoding, is a variable length code, roughly equivalent to 20-bit encoding, the value between 0 and 0x10FFFF, basically is the implementation of unicode encoding. it is a variable length code, which is related to the CPU order, but because it saves the most space, it is often used as an external code for network transmission.
The UTF-16 is unicode preferred encoding.

UTF-32, uses only 32-bit encoding in the unicode range (0 to 0x10FFFF), equivalent to a subset of the UCS-4.

UTF and unicode:

Unicode is a character set and can be viewed as an internal code.
UTF is a encoding method because unicode is not suitable for direct transmission and processing in some scenarios. UTF-16 is unicode encoding directly, no transformation, but it contains 0x00 in the encoding, the first byte of the first 256 bytecode is 0x00, in the operating system (C language) it has special significance and may cause problems. using UTF-8 encoding to convert unicode directly can avoid this problem and bring some advantages.
Reference: blog.csdn.net/..0.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.