"Windows Programming" series fourth: Programming with Unicode

Source: Internet
Author: User
Tags control characters

We learned about Windows programming text and font output, in the above examples also appeared some with "text" of the Windows macro definition, have a friend message to understand some of the ANSI and Unicode programming content, This chapter is about understanding and learning some of the basics of programming for ANSI and Unicode in Windows.

The computer was first born in the United States, so the first is in English as an interactive language, because only 26 letters, with a byte (range-128 ~ 127), this range is enough to represent 26 because of characters and some commonly used control characters, this is ASCII encoding. As a result, the earliest programming languages and the strings used are represented by byte arrays, and indeed the various requirements of programming are met. But with the popularization of computers, the scope of the country is gradually beyond the use of English, so that the character encoding becomes a problem, because many countries language character number can not be expressed in a byte, such as our country's Chinese, more than 4,000 commonly used, if coupled with some less commonly used characters, is far more than these , so the string encoding of a byte doesn't work, so it's natural to have two bytes or even a multibyte encoding.

In addition to the basic ASCII encoding, the current character encoding commonly used are MBCS, BG2312, GBK, UTF-8, UTF-16, UTF-32, BIG5, Base64, Unicode, and so on, in fact, Unicode is the use of UTF-16 encoding. All systems now support multibyte encoding, WINDOWS98 previous Unicode support is not good, many kernel functions need to convert the string before processing, from the Windows NT system almost all use the Unicode encoding re-system kernel, Non-Unicode encoding is processed by the incoming kernel after conversion.

In the birth of C language, also did not encounter multi-byte string problems, of course, there is no Unicode and other such encodings, standard C library functions are ASCII encoding when processing strings, so the use of the standard C function to handle multibyte character encoding there is a problem, So different systems are internally doing this kind of character encoding processing. So the question is, since subscript C doesn't support Unicode, how do we program Unicode? How do we specify whether the strings in the program are ASCII or Unicode or both appear in a program? Better yet, how do we write programs that compile ASCII and Unicode (the following wide character) versions according to our own needs? In this article, we will talk about this issue. A wchar_t variable type is provided in the C/s compiler provided by Microsoft, which is actually an unsigned 16-bit integer defined by a typedef. We use this to define a wide-character version of the character and string, while the normal ANSI is also defined by the standard C-language char.

    • Use of wide strings

Let's compare the definition of ASCII and Unicode characters (strings) and how constants are defined.

ASCII version:

Char c = ' A '; Char str[] = "Hello, world";

Wide character version:

wchar_t WCH = l ' A '; wchar_t wstr[] = l "Hello, World";

The Microsoft compiler uses this capital letter "L" to recognize that the following string will be compiled into a Unicode character or string, noting that there cannot be any spaces behind the L .

Look at the following example:

#include <windows.h> #include <stdio.h>int main (void) {char c = ' A '; char str[] = "Hello, ANSI"; wchar_t wch = L ' A '; wchar_t wstr[] = L "Hello, Unicode";p rintf ("1--and%c\n", c);p rintf ("2--and%s\n", str);p rintf ("3--%c\n", WCH); printf ("4--%s\n", wstr);p rintf ("5--and%c\n", C);p rintf ("6--and%s\n", WSTR); wprintf (L "7--and%c\n", WCH); WPRI NTF (L "8--and%s\n\n", WSTR); System ("pause"); return 0;}

The output of this applet is as follows:

Can be seen:

    1. Use printf to output ANSI characters and strings (nonsense)
    2. Use wprintf to output Unicode characters and strings
    3. printf can output wide characters and strings in uppercase letters C, S, or "%c" "%s"
    4. You can see that the 3rd and 4th with printf can output wide characters, but the wide string only output the first character of the string, in fact, this is the problem, cannot output, the 3rd character A is actually completely lucky, because Unicode is a double byte, so the wide character "A" is actually in hexadecimal " 00 41 ", and the Windows system is a small-end system, so the in-memory typesetting is" 41 00 ... ", so the first one just outputs a. And the 4th can only output an "H", also for this reason. The string wstr exists in the form of memory as follows:

The first character is "H", its wide character in the memory arrangement (small end system) is "68 00 ...", according to the C language rules, the string with the null character 0x00 is the Terminator, so when using printf and%s to output, the system does not know that the H is a wide character, but as a backward to the null character, This is just the second one, so you can only output one "H".

The same is true of the scanf function:

scanf ("%s", str); //This is the normal use of C language

scanf ("%s", WSTR); //This is working, but receiving the result is a string in ANSI format

scanf ("%s", WSTR); //This can correctly receive a wide character format string

WSCANF (L "%s", WSTR); //This is the standard receive wide character format string

The way in which printf and scanf use%s to handle wide characters is extended by Microsoft, not necessarily other compilation systems.

    • Unicode string support functions

From the above we can see that the Microsoft compiler for wide-character and wide string constants with an uppercase "L" as a prefix to the master compiler, followed by the string as a Unicode version instead of the ANSI version. In addition printf and scanf also have a wide character version for the function wprintf and wscanf to handle, from MSDN we know that all about characters/strings have two versions, such as _wfopen, _getws, Wcslen, wcscpy, Wcscat is a wide-character version of the standard C function fopen, gets, strlen, strcpy, strcat. In addition to these wide character functions for C, the Windows API also has ANSI and Unicode versions, such as the Createwindowa, CreateProcessA, and so on for creating forms and spaces, which are ANSI versions, and the corresponding CREATEWINDOWW, CREATEPROCESSW are Unicode versions, and the string types they handle must be strings of wchar_t.

In a program, we can use the ANSI version of the function to handle the corresponding string, but also can use the Unicode version of the function to handle the wchar_t string, as the above example, but must correspond to, otherwise there may be a compilation error, More troubling is the possibility of compiling the pass but the result is not what we want, such as the 41st output above.

Of course, if it is not necessary, it is best not to use ANSI in the program for a while, using Unicode, which will be poor for future portability, and not conducive to multilingualism and internationalization. It is strongly recommended to use the Unicode version to write the program, this is a big trend, if you want to transplant the PC platform Windows program to Microsoft's embedded platform on the win CE, it must be Unicode. For simplicity and versatility, Microsoft only supports Unicode on the win CE platform. And the use of Unicode encoding is more efficient, because now the Windows operating system kernel is all in Unicode version, if it is passed an ANSI, it must first converted to a Unicode string, and then passed into the internal function processing.

    • Supports two encodings at the same time

Of course, the ideal scenario is if you write a unified application, compile to compile into ANSI when compiled to ANSI version, want to compile into Unicode version of the Unicode version is the best, so that we write out of the program, whether it is portability or versatility is the best, in fact, this Microsoft has long thought of.

Microsoft constructs a set of platform-related string processing macro definitions for standard C functions, meaning that these macros are defined by Microsoft itself, and are only used under the Windows platform, not the standard stuff. These definitions become different versions in different situations. If the macro definition "_unicode" is defined, Windows will be working with the C + + function in the UNICODE version or the ANSI version. Let's look at how Windows is defined using the Strlen function:

#ifdef _unicode#define _tcslen    wcslen#else#define _tcslen    strlen#endif

The _tcslen here is the platform related to the string to find the character length of the macro definition, of course, we use it as a function on the line, you can see if the definition of _unicode, then _tcslen at compile time is actually a link wcslen, otherwise link strlen. Now that we open the VS header file "Tchar.h", we can see a lot of macro definitions starting with the underscore, which are platform-dependent common string processing library functions:

So use these functions to include this header file.

In addition, if the "UNCODE" macro is defined, the Windows API also uses the Unicode version, otherwise the ANSI version is used. For example CreateWindow This function is defined as follows:

#ifdef unicode#define CreateWindow  createwindoww#else#define createwindow  createwindowa#endif//! Unicode

So actually CreateWindow is a macro definition, but that doesn't affect the way we use it as a function, as well as other Windows APIs that contain strings as parameters.

By default, we use VS to build the project, both the _UNICODE and Unicode macros are open, so the project we created with the wizard is a Unicode version, and we can also remove these two definitions from the configuration options to compile the ANSI version of the program.

Now that the use of the function is resolved, how do you define the character and the variable type of the string, so that the _unicode and Unicode definitions can affect the types and constants? Microsoft also uses a series of definitions to solve this problem. TCHAR is a variable type of character, string, equivalent to char and wchar_t, if the Unicdoe,tchar is actually wchar_t, otherwise it is char, which can be found in winnt.h.

For string constants, vs defines TEXT, __text, Tchar.h, and _t in several ways, as long as Unicode is defined, the macro definition is Unicode, otherwise it is the ANSI version. Therefore, we should use these macros to define string type variables, constants, and processing functions when we write programs later. The following is a recommended simple example:

#include <windows.h> #include <tchar.h>int _tmain (void) {TCHAR c = TEXT (' A '); TCHAR buf[16]; TCHAR *str = TEXT ("Hello, world!"); _tprintf (Text ("1--and%c\n"), c), _tprintf (Text ("2-to-%s\n"), str), _TSCANF (_t ("%s"), buf), _tprintf (_t ("%s\n"), BUF); _tsystem (TEXT ("pause")); return 0;}

In this instance, all functions that are likely to use string literals use a common function that compiles the Unicode version and the ANSI version correctly.

    • Unicode and ANSI string conversions

There are times when we may still have transitions between different encodings, which we can do with the APIs provided by Windows.

The MultiByteToWideChar function and the WideCharToMultiByte function can be converted back and forth between ANSI and Unicode strings. Their parameters have many similarities, and the prototypes are:

int MultiByteToWideChar (UINT CodePage, DWORD dwFlags, LPCSTR lpmultibytestr, int cbmultibyte, LPWSTR lpwidecharstr, int cc HWIDECHAR); int WideCharToMultiByte (UINT CodePage, DWORD dwFlags, lpcwstr lpwidecharstr, int cchwidechar, LPSTR LPMULTIBYTESTR, int cbmultibyte, LPCSTR Lpdefaultchar, Lpbool Lpuseddefaultchar);

You can refer to MSDN for specific usage, and you can find a lot of usage instructions and examples on the web, which are not described here.

Here is an example to demonstrate the conversion between ANSI and Unicode:

#include <windows.h> #include <tchar.h> #include <stdio.h>int _tmain (void) {int Nwch;char ansistr[] = " Hello, world! "; wchar_t wszbuf[20] = {0};//The number of Unicode characters generated after the conversion, which can be passed in as a subsequent actual conversion to the Unicode characters that hold the conversion result buffer size Nwch = MultiByteToWideChar (cp_ ACP, 0, Ansistr,-1, NULL, 0);//convert and receive results MultiByteToWideChar (CP_ACP, 0, Ansistr,-1, Wszbuf, nwch); wprintf (L "Nwch =%d,%s\n ", Nwch, wszbuf); int Nch;char ansibuf[20] = {0};//How many ANSI characters are produced after the conversion, and can be passed in as the actual conversion when the number of ANSI characters that hold the conversion result is buffer size nCh = WideCharToMultiByte (CP_ACP, 0, Wszbuf,-1, NULL, 0, NULL, NULL);//convert and receive results WideCharToMultiByte (CP_ACP, 0, Wszbuf, 1, Ansib UF, nCh, NULL, NULL);p rintf ("nCh =%d,%s\n", NCh, Ansibuf); _tsystem (TEXT ("pause")); return 0;}

Note that the function, as well as the ability to convert, can also get the size of the number of stored characters in the desired output after the turn. After running the output result:

By the end of this article, the next article will continue our tour of the Windows Programming series. Please pay attention!

Focus on the public platform: the Programmer Interaction Alliance (coder_online), you can get the first original technical articles, and (Java/c/c++/android/windows/linux) technology Daniel Friends, online communication programming experience, get programming basics, Solve programming problems. Programmer Interactive Alliance, Developer's own home.

Reproduced please specify the source, thank you for your cooperation!

"Windows Programming" series fourth: Programming with Unicode

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.