Windows programming--wide character sets and Unicode

Source: Internet
Author: User
Tags uppercase letter

From ASCII code to Unicode

Double-byte Character set

So far, we've seen a 256 character character set (ASCII). But there are about 21,000 glyphs in China, Japan and South Korea. How to accommodate these languages and still maintain some compatibility with ASCII?

The solution (if this is correct) is a double-byte character set (Dbcs:double-byte character set). DBCS starts with 256 code, just like ASCII. As with any code page that behaves well, the original 128 code is ASCII. However, some of the higher 128 codes always follow the second byte. These two bytes together (called the first byte and followed byte) define a character, usually a complex glyph.

Although Chinese, Japanese and Korean share some of the same hieroglyphics, it is clear that the three languages are different, and often the same hieroglyphics represent three different things in three different languages. Windows supports four different double-byte character sets: code page 932 (Japanese), 936 (Simplified Chinese), 949 (Korean), and 950 (traditional Chinese characters). DBCS is only supported for versions of Windows that are produced for these countries (regions).

The problem with double-character sets is not that characters are represented by two bytes. The problem is that some characters (especially ASCII characters) are represented by 1 bytes. This can cause additional programming problems. For example, the number of characters in a string cannot be determined by the number of bytes in the string. The string must be parsed to determine its length, and each byte must be examined to determine whether it is the first byte of a double-byte character. If there is a pointer to the middle of a DBCS string, what is the address of the previous character of the string? The usual solution is to parse the string from the beginning of the pointer!

Unicode Solutions

The basic problem we face is that the writing language in the world cannot be represented simply by 256 8-bit codes. Previous solutions including code pages and DBCS have proven to be less than satisfying and clumsy. So what is the real solution?

As a program writer, we have experienced this kind of problem. If there are too many things to do with a 8-bit value, then we'll try a wider value, such as a 16-bit value. And it's interesting that it's the reason that Unicode was made. Unlike a confusing 256-character code image, and a double-byte character set that contains some 1-byte code and some 2-byte code, Unicode is a unified 16-bit system, allowing for 65,536 characters to be represented. This is sufficient to represent all the characters and the world's languages that use hieroglyphs, including a set of mathematical, symbolic, and monetary unit symbols.

It is important to understand the difference between Unicode and DBCS. Unicode uses the "wide character set" (especially in the context of C programming languages). Each character in the "unicode is 16 bits wide instead of 8 bits wide. "In Unicode, there is no use of a 8-bit numeric value alone. In contrast, we still handle 8-bit values in the double-byte character set. Some bytes define characters themselves, while some bytes display the need to define a character together with another byte.

Handling DBCS strings is messy, but working with Unicode literals is like working with ordered text. You might be happy to know that the first 128 Unicode characters (16-bit code from 0x0000 to 0x007f) are ASCII characters, and the next 128 Unicode characters (code from 0x0080 to 0X00FF) are ISO 8859-1 extensions to ASCII. Characters in different parts of Unicode are also based on existing standards. This is for ease of conversion. The Greek alphabet uses code from 0x0370 to 0x03ff, Slavic uses code from 0x0400 to 0X04FF, the United States uses code from 0x0530 to 0x058f, and Hebrew uses code from 0x0590 to 0X05FF. Chinese, Japanese, and Korean hieroglyphs (collectively called CJK) occupy code from 0x3000 to 0X9FFF.

The biggest benefit of Unicode is that there is only one character set, no ambiguity. Unicode is actually the result of almost every important company in the personal computer industry working together, and it corresponds to the code in ISO 10646-1 standard one by one. An important reference for Unicode is the Unicode standard,version 2.0 (Addison-wesley Press, 1996). This is a special book that shows the richness and diversity of the written language of the world in rare ways in other documents. In addition, the book provides the basic principles and details for developing Unicode.

Does Unicode have any drawbacks? Of course. A Unicode string consumes twice times the memory of an ASCII string. (compressing files, however, can greatly reduce the amount of disk space the file occupies.) But perhaps the worst drawback is that people are relatively not accustomed to using Unicode. As a program writer, this is our job.

Wide characters and C

For C-Program writers, the 16-character idea is really disappointing. A char and a byte width are one of the most uncertain things. Few programmers know Ansi/iso 9899-1990, which is "the American National Standard programming language-c" (also known as "ansi c") supports character sets that use multiple bytes to represent one character through a concept called "wide character." These wide characters coexist perfectly with commonly used characters.

ANSI C also supports multi-byte character sets, such as the Chinese, Japanese, and Korean versions of Windows supported character sets. However, these multibyte character sets are treated as a single-byte string, but some of these characters change the meaning of subsequent characters. The multi-byte character set mainly affects the C language Program execution period link library function. By contrast, wide characters are justifies than normal characters and can cause some compilation problems.

Wide characters do not need to be Unicode. Unicode is a possible wide character set. However, since the focus of this book is on Windows rather than on the theory of C execution, I will use wide characters and Unicode as synonyms.

Definition of "character" in Windows programming

Standard character definitions in Win32:

/*standard C character definition*/Charc ='A';//The variable c requires 1 bytes to be saved and will be initialized with the hexadecimal number 0x41, which is the ASCII code of the letter aChar* p;//Windows is a 32-bit operating system, so pointer variable p needs to be saved in 4 bytesChar* p ="hello!";//As before, the variable p also needs to be saved in 4 bytes. The string is saved in static memory and occupies 7 bytes-6 bytes to hold the string, and the other 1 bytes to hold the terminating symbol 0. Chara[Ten] ;//The compiler retains 10 bytes of storage space for the array. The expression sizeof (a) will return 10. CharA[] ="hello!";//You can use a statement like the following to initialize an array of characters, including the end of ' + ' requires 7 bytes of space

Unicode or wide characters do not change the meaning of the char data type in C. Char continues to represent a 1-byte storage space, and sizeof (char) continues to return 1. Theoretically, 1 bytes in C can be longer than 8 bits, but for most of us, 1 bytes (i.e. 1 char) are 8 bits wide.

typedef unsigned Shortwchar_t;//the wide character in C is based on the wchar_t data type, which includes WCHAR in several header files . there are definitions in H//The wchar_t data type is the same as the unsigned short integer pattern, which is 16-bit wide//to define a variable that contains a wide character, use the following statement:wchar_t C ='A';//the C compiler expands the character to make it a wide character. //the definition of a wide-byte string is:wchar_t * p = L"hello!" ;Staticwchar_t a[] = L"hello!" ;//Note The uppercase letter L (representing "long") immediately preceding the first quotation mark. This tells the compiler that the string is characters by a wide character-that is, each character occupies 2 bytes.
In general, the pointer variable p takes 4 bytes, and the string variable requires 14 bytes-2 bytes per character, and 0 at the end requires 2 bytes.

However, there are some problems to be solved:

1. strlen () can execute correctly only when the standard C character is accepted, and all functions that use parameters of the char* type must be overridden

2, how to make code can be written to support the use of single-byte characters and wide-byte characters of the system running on it?

For the first question : all wide-byte functions have been overridden, #include <Windows.h> can be used, and the specific wide-byte function is <string. H> <wchar. H> has definitions such as Strlen's wide-byte version of Wcslen (Wide-character string length: wide string lengths)

//the wide-byte function is defined in <string. H> <wchar. H>//The strlen function is described below:size_t __cdecl strlen (Const Char*) ; //The wcslen function is described below:size_t __cdecl Wcslen (Constwchar_t *) ; Then wchar_t pw[]= L"hello!" ;intIlength =wcslen (PW);intSize =sizeof(PW);//Ilength = 6 size =

For a second question,

The main reason for this is that using Unicode also has drawbacks. The 1th and most important point is that each string in the program will occupy twice times the storage space. In addition, you will find that the functions in the link library for the wide character execution period are larger than the regular functions. For this reason, you might want to build two versions of the program-one that handles ASCII strings and another that handles Unicode strings. The best solution is to maintain a single source code file that can be compiled in ASCII as well as Unicode.

Although it is only a small program, you have to define different characters because of the different names of the link library functions at the time of execution, which will lead to trouble when dealing with string literals preceded by L.

One approach is to use the TCHAR.H header file that is included with Microsoft Visual C + +. The table header file is not part of the ANSI C standard, so there is an underscore (_) in front of each function and macro definition defined there. TCHAR. H provides a series of alternative names (for example, _tprintf and _tcslen) for standard execution time link library functions that require string parameters. Sometimes these names are also called "GENERIC" function names because they can point to either the Unicode version of the function or to a non-Unicode version.

"GENERIC" is implemented primarily by precompiling, by using conditional compilation #ifdef _UNICODE to define alternate names differently.

// A function that defines how alternative names are interpreted into different versions in header file <TCHAR.h> // If the _unicode identifier is defined, then _TCHAR is wchar_t: typedef wchar_t _TCHAR;   // If the _unicode identifier is defined, then _tchar is char: Char _tchar;  

Windows programming--wide character sets and Unicode

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.