One, what is Unicode
First from ASCII, ASCII is a coding specification used to represent English characters. Each ASCII character occupies 1 bytes, so the maximum number of characters that ASCII encoding can represent is 255 (00H-FFH). In fact, the English character is not so much, generally only the first 128 (00H-7FH, the highest bit of 0), including control characters, numbers, uppercase and lowercase letters and other symbols. The other 128 characters (80H-FFH), the highest level of 1, are called extended ASCII, and are generally used to hold some of the other symbols of English tabs, partial phonetic characters, and so on.
This rule of character encoding is obviously used to deal with English without any problems. But in the face of Chinese, Arabic and other complex text, 255 characters is obviously not enough.
As a result, various countries have developed their own coding norms, in which the Chinese text encoding specification is called "gb2312-80", which is compatible with the ASCII encoding code, in fact, the use of extended ASCII does not really standardize this point, A Chinese character is represented by two extended ASCII characters to differentiate the ASCII portion.
But there is a problem with this approach, the biggest problem is that Chinese text encoding and extended ASCII code overlap. But many software uses the extended ASCII code the English tab to draw the form, such software uses in the Chinese system, these tables will be mistaken for Chinese character, appears garbled.
In addition, because countries and regions have their own coding rules, they conflict with each other, which brings great trouble to the exchange of information between countries and regions.
To really solve this problem, you can't start with extended ASCII, you have to have a whole new coding system that can be used in Chinese, French, German ... Wait, all the text is unified, and each text is assigned a separate encoding.
So, Unicode was born.
Unicode is also a character encoding method that occupies two bytes (0000H-FFFFH) and holds 65,536 characters, which can fully accommodate the encoding of all languages in the world.
In Unicode, all characters are treated equally, Chinese characters no longer use "two extended ASCII", but "1 Unicode" is used, that is, all text is processed by one character, all of which have a unique Unicode code.
Ii. benefits of using Unicode encoding
Using Unicode encoding enables your project to support multiple languages at the same time, making your project internationalized.
In addition, Windows NT is developed using Unicode, and the entire system is based on Unicode. If you call an API function and pass it an ANSI (ASCII character set and the character set derived and compatible by it, such as GB2312, commonly called the ANSI character set), the system first converts the string to Unicode. The Unicode string is then passed to the operating system. If you want the function to return an ANSI string, the system first converts the Unicode string to an ANSI string, and then returns the result to your application. The conversion of these strings takes up the time and memory of the system. If you use Unicode to develop your application, you can make your application run more efficiently.
The following example shows the encoding of several characters to simply demonstrate the difference between ANSI and Unicode:
Character |
A |
N |
And |
ANSI Code |
41H |
4eH |
Cdbah |
Unicode code |
0041H |
004eH |
548cH |
Third, use C + + for Unicode programming
Support for wide characters is actually part of the ANSI C standard, which supports multibyte representations of one character. Wide characters and Unicode are not exactly equivalent, Unicode is just one encoding of wide characters.
1, the definition of wide characters
In ANSI, the length of one character (char) is one byte (byte). When using Unicode, a character occupies one word, and C + + defines the most basic wide character type wchar_t in the wchar.h header file:
typedef unsigned short wchar_t;
From here we can see clearly that the so-called wide characters are unsigned short integers.
2, Constant width string
For C + + programmers, constructing string constants is a recurring task. So, how do you construct a wide-character string constant? Simply add a capital l to the string constant, such as:
wchar_t *str1=l "Hello";
This l is very important, only with it, the compiler will know that you want to save the string into a single character word. Also note that you cannot have spaces between L and strings.