I. Wide character Introduction
First of all, what is ASCII, ASCII is an encoding standard used to represent English characters. Each ASCII character occupies 1 byte. Therefore, the maximum number of characters that can be represented by ASCII encoding is 255 (00h-ffh ).
In fact, there are not so many English characters, generally only the first 128 (00h-7fh, the highest bit is 0), including control characters, numbers, uppercase and lowercase letters and other symbols. And the highest bit
The other 128 characters (80h-ffh) with a value of 1 are called "extended ASCII" and are generally used to store some other symbols such as English tabs and some phonetic symbols.
That is, char a_char [] = "hello ";
Each a_char character occupies a byte, that is, 8 bit space. Of course, because it is a C-style string, there is '/0' at the end, and the result of sizeof is 6 bytes. This is the ASCII code.
Char a_char [] = "Hi hi ";
In this case, cout <sizeof (a_char) <Endl; result is 7, that is, each Chinese Character occupies 2 bytes, and each English character occupies only one byte, and there is a '/0' end, which is ASCII encoding.
What is a wide character? What is Unicode first.
Unicode is also a character encoding method. It occupies two bytes (0000h-ffffh) and contains 65536 characters, which can fully accommodate the encoding of all languages in the world.
In Unicode, all characters are treated equally. Chinese characters no longer use "two extended ASCII", but use "1 Unicode". That is to say, all texts are processed by one character, they all have a unique Unicode code.
The following code illustrates the differences between ANSI and UNICODE:
Characters a n and
Anⅱ Code 41 h 4eh cdbah
Unicode 0041 H 004eh 548ch
After reading what I said above, I believe that the encoding here is not hard for you to understand.
What is a wide character: the support for wide characters is actually part of the ansi c standard, used to support multi-byte expression of a character. The width character is not exactly the same as the Unicode character. Unicode is only a type of width character encoding.
1. Definition of wide characters
In ANSI, the length of a character (char) is one byte ). When Unicode is used, a character occupies one word. c ++ defines the most basic wide character type wchar_t in the wchar. h header file:
Typedef unsigned short wchar_t;
Here we can clearly see that the so-called wide character is an unsigned short integer.
2. Use of wide characters
This is simple. wchar_t * str1 = l "hello ";
This l is very important. The Compiler only knows that you want to save the string as a character. Note that there must be no space between string and L.
3. Wide string library functions
C ++ specifically defines a set of functions to operate on wide strings. For example, the function to evaluate the length of a wide string is
Size_t _ cdel wchlen (const wchar_t *);
Why do we need to define these functions? The most fundamental reason is that ANSI
All strings at the end of the string are identified by '/0' (the Unicode string ends with "/0/0"). The correct operations of many string functions are based on this. And we know that
In the case of characters, a character occupies the space of a word in the memory, which will make the string function that operates the ANSI character unable to operate correctly. Take the "hello" string as an example.
The five characters are:
0x0048 0x0065 0x006c 0x006c 0x006f
In the memory, the actual arrangement is:
48 00 65 00 6C 00 6C 00 6f 00
Therefore, when an ANSI string function, such as strlen, encounters the first 00 after 48, it will consider the string to the end, the result of using strlen to evaluate the length of a wide string will always be 1!
Ii. Conversion between wide characters and ASCII characters
Most of them use Windows APIs, multibytetowidechar and widechartomultibyte to convert between wide characters and ASCII.
Check this code.
# Include <windows. h> <br/> # include <iostream> <br/> using namespace STD; <br/> int main () {<br/> wchar_t wtext [] = {L "wide character conversion instance! OK! "};< Br/> int I; <br/> cout <sizeof (wtext) <Endl; // The number of bytes occupied by wide characters, 24 bytes <br/> DWORD dwnum = widechartomultibyte (cp_oemcp, null, wtext,-1, null, 0, null, false ); // obtain the bytes required for converting the wide character array wtext into ASCII. <br/> cout <dwnum <Endl; // The dwnum length is 19 bytes. <br/> char * pstext; pstext = new char [dwnum]; widechartomultibyte (cp_oemcp, null, wtext,-1, pstext, dwnum, null, false); // convert the width to ASCII, memory written to pstext <br/> cout <pstext <Endl; <br/> Delete [] pstext; <br/> return 0; <br/>}
The code above can be converted from a wide character to an ASCII character.
# Include <windows. h> <br/> # include <iostream> <br/> using namespace STD; <br/> int main () {<br/> char stext [] = {"Multi-byte string! OK! "}; <Br/> cout <sizeof (stext) <Endl; // 17 bytes of ASCII characters, including '/0' DWORD dwnum = multibytetowidechar (cp_acp, 0, stext,-1, null, 0); <br/> cout <dwnum <Endl; // to convert the ASCII string, the number of characters in width is 11, '/0' is also converted to the wide character wchar_t * pwtext; pwtext = new wchar_t [dwnum]; multibytetowidechar (cp_acp, 0, stext,-1, pwtext, dwnum ); // perform conversion <br/> setlocale (lc_all, ""); // to output wide characters, set wcout <pwtext <Endl; // to output wide characters, note: Use wcout <br/> Delete [] pwtext; <br/> return 0; <br/>}</P> <p>
The code above can be converted from ASCII characters to wide characters.
So far, you should have a general understanding of the wide character Unicode and ASCII multi-byte encoding.