Unicode and string processing

Last Update:2013-12-08 Source: Internet

Author: User

Tags microsoft c

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We are familiar with the full name of the ASCII code is the United States National Information Exchange Standard code, which originated in the end of 1950s, and finally finalized in 1967. The ASCIIS Code uses a 7-bit Width. It has 26 lower-case letters, 26 upper-case letters, 10 digits, 32 characters, 33 control codes, and a space code, 128 codes in total. ASCII is widely used and is a very reliable standard. However, ASCII is a real American Standard and cannot even meet the needs of other English-speaking countries. For example, ASCII codes do not have the pound symbol. We know that some language and text systems (such as Chinese characters) have many characters in character sets, but one byte can only represent up to 256 characters, which is far from enough. To support these text systems, the dual-byte character set (DBCS) came into being. In the double-byte character set, a character consists of one or two bytes. For programmers, dealing with Double Byte Character sets is like a nightmare, because programmers need to determine whether each byte is a double byte's leading byte. Unlike the chaos of DBCS, Unicode uses 16-bit encoding, that is, UTF-16 encoding. UTF stands for Unicode Transformation Format. The UTF-16 encodes each character into 2 bytes (16 bits ). In this way, the application can easily traverse the string length. 1. We know the char data type. The C language uses the char data type to represent an 8-bit ANSI character. When a string is declared in the Code, the C compiler converts the characters in the string to an array consisting of 8-bit char data types. For example, [cpp] char c = 'a'; char szBuffer [100] = "A String"; you can define A pointer to A String: [cpp] char * p; since windows is a 32-bit system, the pointer Variable p requires 4 bytes of storage space. You can also define and initialize a pointer to a string: [cpp] char * p = "Hello! "; Variable p is the same as before, and only needs 4 bytes of space. Strings are stored in static memory and 7 bytes are used for storage-6 bytes are used to store strings and the other byte stores the ending '\ 0 '. 2. The wchat_t type Microsoft C/C ++ compiler defines a built-in data type wchat_t, representing a 16-bit Unicode (UTF-16) character. The method for declaring Unicode characters and strings is as follows: [cpp] wchar_t c = L 'a'; wchar_t szBuffer [100] = L "A String "; the uppercase letter L before the string notifies the compiler that the string should be compiled into a Unicode string. When the compiler puts this string into the program's data segment, each character is encoded using a UTF-16. To distinguish it from the C language, the Windows development team wants to define its own data type. Therefore, they define the following data types: [cpp] typedef char CHAR; // an 8-bit character typedef wchar_t WCHAR; // a 16-bit character, windows also defines a series of convenient data types that can be used to process character pointers and string pointers: [cpp] // points to 8 characters (strings) pointer typedef CHAR * PCHAR; typedef CHAR * PSTR; typedef const char * PCSTR; // point to the 16-bit character (string) pointer typedef WCHAR * PWCHAR; typedef WCHAR * PWSTR; typedef const wchar * PCWSTR; 3. Maintain a source code. When writing code, you can use ANSI or Unicode characters/strings. To enable compilation, windows defines the following types of macros: [cpp] # ifdef UNICODE typedef wchar tchar, * PTCHAR, PTSTR; typedef const wchar * PCTSTR; # define _ TEXT (quote) L ## quote # else typedef char tchar, * PTCHAR, PTSTR; typedef const char * PCTSTR; # define _ TEXT (quote) quote # endif # define TEXT (quote) _ TEXT (quote) with these types and macros, both ANSI and Unicode can be compiled. [Cpp] // If UNICODE is defined, it is A 16-bit character. Otherwise, an 8-bit character tchar c = TEXT ('A') is used. // If UNICODE is defined, A 16-Bit String. Otherwise, an 8-Bit String TCHAR szBuffer [100] = TEXT ("A String") is used "); iv. Unicode and ANSI functions in Windows if a parameter list of a Windows function contains strings, the function usually has two versions. For example, the MessageBox function has two entry points: MessageBoxA accepts ANSI strings and MessageBoxW accepts Unicode strings. MessageBoxA is defined as follows: [cpp] int WINAPI MessageBoxA (HWND hWnd, LPCSTR lpText, LPCSTR lpCaption, UINT uType); MessageBoxW is defined as follows: [cpp] int WINAPI MessageBoxW (HWND hWnd, the second and third parameters point to 8-bit and 16-bit strings respectively. When writing code, we only need to use MessageBox. The MessageBoxA function or MessageBoxW function is automatically selected based on whether the UNICODE identifier has been defined. [Cpp] # ifdef UNICODE # define MessageBox MessageBoxW # else # define MessageBox MessageBoxA 5. Unicode and ANSI functions in the C Runtime Library are in the C Runtime Library, strlen is a function that returns the length of an ANSI string. It corresponds to the wcslen function and returns the length of a Unicode string. For ease of use, the following macros are defined: [cpp] # ifdef _ UNICODE # define _ tcslen wcslen # else # define _ tcslen strlen # endif, you only need to use _ tcslen in the code to obtain the length of the string. 6. Recommended character and string processing methods 1. think of a text string as a character array, rather than a char or byte array 2. use a common data type (such as TCHAR/PTSTR) to represent text characters and strings 3. use a clear data type (such as BYTE or PBYTE) to table not bytes, BYTE pointer and data buffer 4. use TEXT or _ T macro to represent the literal and string, but to ensure consistency, avoid mixing 5. UNICODE and _ UNICODE symbols are either specified at the same time, or do not specify 6. avoid using printf functions, especially % s and % S for conversion between ANSI and Unicode strings. The correct method is to use MultiByteToWideChar and WideCharToMultiByte functions 7. modify the string calculation. For example, a function often wants to pass it the number of characters in the buffer size, rather than the number of bytes. In this case, _ countof (szBuffer) should be used instead of sizeof (szBuffer ). If you want to allocate memory blocks to a string, remember that the memory is allocated in bytes. This means that you need to use malloc (nCharacters * sizeof (TCHAR) instead of calling malloc (nCharacters ).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More