Core summary of Windows core programming (chapter II character and string processing) (2018.5.27)

Source: Internet
Author: User
Tags coding standards control characters

Learning Goals

The second chapter is to learn character and string processing, in order to better understand the content of this chapter, I added additional auxiliary content: storage mode (big-endian storage and small-end storage), character encoding scheme (see it). The following are the learning objectives of this chapter:
1. Big-endian storage and small-end storage
2. Character encoding scheme
3.ANSI and Unicode characters, strings, Windows Custom data types (for ANSI and Unicode compatibility)
ANSI functions and Unicode functions for 4.Windows
ANSI and Unicode functions for the 5.C runtime
Secure string functions for the 6.C runtime
Secure string function for 7.C runtime (Advanced version)
8. String comparison function
9. Conversion functions between wide characters and ASCII characters

Essential Knowledge-big-endian storage and small-end storage

How do you understand big Endian (end storage) and little Endian (small side storage)?
As an example:
int a = 1;
A 16 binary representation of the number itself is 0x00 00 00 01
How is it stored in memory?
If your CPU is an Intel x86 architecture (which is basically what we call a Pentium CPU), then it is 0x01 0x00 0x00 0x00, which is known as Little-endian, and low bytes are stored in the lower memory.
If your CPU is in the old-fashioned AMD series (the oldest is the old one, because the latest AMD series is already a x86 architecture), its byte order is Big-endian, its memory is 0x00 0x00 0x00 0x01 in memory from high-byte storage.
The vast majority of the world's CPUs are now Little-endian.

Character encoding

development process: ascii-extension ascii-gb2312-gbk-gb18030
The United States is the first to start using computers, a byte has eight-bit binary, can be combined out 256 different states, they will control characters, spaces, punctuation, numbers, uppercase and lowercase letters with a continuous byte state, coded to the 127th number, so that the computer can use different bytes to store English text. At this point, they are called ANSI ASCII encoding. Later, the computer popularized the world, because 127 kinds of characters can not represent other countries, text, and then they want to extend the ASCII, using the number of digits after 127 to other characters and text, has been encoded into the state 255, from 126 to 255 the character set of this page is called the extended character set. When we use computers in China, then just rely on ASCII is not enough storage, China's own idea to solve this problem, we stipulate: a character less than 127 is the same as the original ASCII encoding, but two more than 127 words connect prompt together, it represents a Chinese character. Meaning: An English letter is still a byte storage, and a Chinese character with two bytes to represent, low 8 bytes stored in the low address location, high 8 bytes stored in the high address location, if the low address location stored in the lower 8 bytes of the 8th bits greater than 1, A high 8-byte 8th bits that is stored at a high address location is also greater than 1, then it is recognized as a Chinese character. So we can probably assemble about 7,000 more Simplified Chinese characters, in these coding, we also put the mathematical symbols, even ASCII original number, punctuation, letters are all re-encoded, this is the "full-width" character we encountered, and the original under 127th is called "Half-width" character. After that, China called the Chinese character coding scheme "GB2312". Later, the Chinese culture is profound, some Chinese characters have not been fully encoded, so it is decided simply as long as the high-byte character is greater than 127 means that the character is a Chinese character, and then called the coding scheme as the GBK standard. When decoding with GBK, if the highest bit of high byte is 0, it is decoded with the ASCII encoding code, and if the highest byte is 1, it is decoded with the GBK encoding table. GBK after the GB18030 standard, because GB18030 more than GBK and more than thousands of characters, code points, GB18030 using 2byte and 4byte hybrid coding, which adds to the software problems, so although the GB18030 launched nearly 5 years, still not widely used. In front of a bunch of Chinese character coding, we are always called "DBCS", that is, a double-byte character set. After the development of the previous coding, there is a serious problem, that is, all countries like China to make a set of their own coding standards, the results of each other's computers do not know each other, who do not support each other's code. Later, the ancestors thought of abolishing all the regional coding schemes and re-doing a coding scheme that included all the cultures, words and symbols on earth, which they called the coded sideThe case is "Unicode encoding".
Unicode encoding is available in the following ways:
UTF-8: One byte one character, some characters are 2 bytes, some characters are 3 bytes, and some characters are 4 bytes.
UTF-16: Most of the characters are 2 bytes. The default Unicode encoding under Windows platform is little endian UTF-16.
UTF-32: All characters are 4 bytes.

ANSI and Unicode characters, strings, Windows custom data types

The ANSI character is the C language that uses the char data type to represent a 8-bit character. An ANSI string is an array of multiple char data types that represent multiple bytes of string. For example:

char a=‘a‘;//‘a‘这个常量字符在常量存储区存储为1个字节。而a在栈区存储为1个字节。char szBuffer[10]="abcdefg";//"abcdefg"这个常量字符在常量存储区存储为8个字节。而szBuffer在栈区存储为10个字节。

Previously, Unicode characters used wchar_t to represent a two-byte wide character (Unicode character), the previous C header file has such a definition: typedef unsigned short wchar_t, the wchar_t is actually just a non-signed shorter integer. The C compiler later defined wchar_t as the basic data type as int, and at this point in the higher version of the compiler you could not find the typedef unsigned short wchar_t this statement. If you want to represent a constant character and a constant string as a Unicode version, add an L to the front. For example:

wchar_t c=L‘a‘;//L‘a’这个常量字符在常量存储区存储为2个字节。而c在栈区存储为2个字节。wchar_t szBuffer[10]=L"abcdefg";//L"abcdefg"这个常量字符在常量存储区存储为16个字节。而szBuffer在栈区存储为20个字节。

In order to differentiate slightly from the C language, and to be compatible with ANSI and Unicode characters or strings, Windows customizes some data types: TCHAR data type, text macro.
and the header file for the TCHAR data type and text macro is defined as follows:

#ifdef UNICODE//r_winnt#ifndef _tchar_definedtypedef WCHAR TCHAR, *ptchar;typedef WCHAR tbyte, *pt BYTE; #define _TCHAR_DEFINED#ENDIF/*!_tchar_defined */typedef lpwch lptch, Ptch;typedef lpcwch lpctch, PCTCH;typedef LPW STR ptstr, Lptstr;typedef lpcwstr pctstr, Lpctstr;typedef lpuwstr putstr, Lputstr;typedef LPCUWSTR PCUTSTR, LPCUTSTR;type def lpwstr lp;typedef pzzwstr pzztstr;typedef pczzwstr pczztstr;typedef puzzwstr puzztstr;typedef PCUZZWSTR PCUZZTSTR; typedef pzpwstr PZPTSTR;TYPEDEF pnzwch pnztch;typedef pcnzwch pcnztch;typedef punzwch punztch;typedef PCUNZWCH PCUNZTCH; #define __text (quote) l# #quote//r_winnt#else/* UNICODE *//R_winnt#ifndef _tchar_definedtypedef Char TCHAR, *ptchar;typedef unsigned char tbyte, *ptbyte; #define _TCHAR_DEFINED#ENDIF/*!_tchar_defined */typedef lpch Lptch, Ptch;typedef lpcch lpctch, Pctch;typedef LPSTR ptstr, LPTSTR, Putstr, Lputstr;typedef LPCSTR pctstr, LPCTSTR, PCUTS TR, Lpcutstr;typedef pzzstr PZZTstr, Puzztstr;typedef pczzstr pczztstr, pcuzztstr;typedef pzpstr pzptstr;typedef pnzch PNZTCH, PUNZTCH;typedef PCNZCH PC Nztch, Pcunztch; #define __TEXT (quote) quote//R_WINNT#ENDIF/* UNICODE *//R_winnt#define TEXT (q  Uote) __text (quote)

From the beginning of the file definition, we can see that there are two possibilities for the TCHAR data type, if Unicode is defined, it is WCHAR (actually wchar_t, wide character), and if non-Unicode (multibyte character set) is defined, it is char (narrow character). We know that when we open the VS compiler, the default is to take the Unicode character set, in fact this option represents the code we wrote in the program: #define UNICODE. That means we write TCHAR, is actually wchar_t. And if we change the character set in the options to a multibyte character set, then it's equivalent to defining non-Unicode, which means we're writing TCHAR, which is char. For the text macro, the same is true, if it is a Unicode character set, then it is defined as l# #quote (for the quote to add the l,quote can be a character, or a string), if it is a multibyte character set, then go to the definition of quote (mean nothing to add). Here's an example:

//Unicode字符集TCHAR c=TEXT(‘a‘);//TEXT(‘a‘)相当于L’a‘,在常量存储区存储为2个字节。而c在栈区存储为2个字节。TCHAR szBuffer[10]=TEXT("abcdefg");//TEXT("abcdefg")相当于L"abcdefg",在常量存储区存储为16个字节。而szBuffer在栈区存储为20个字节。
//多字节字符集TCHAR c=TEXT(‘a‘);//TEXT(‘a‘)相当于’a‘,在常量存储区存储为1个字节。而c在栈区存储为1个字节。TCHAR szBuffer[10]=TEXT("abcdefg");//TEXT("abcdefg")相当于"abcdefg",在常量存储区存储为8个字节。而szBuffer在栈区存储为10个字节。

Isn't Windows very smart? Compatible with ANSI and Unicode, the TCHAR and text macros can be encoded automatically using the corresponding encoding method.

ANSI functions and Unicode functions for Windows
  1. In Windows, there are functions of the Unicode type and functions of the ASCII type, such as the CreateWindowEx function.
    In WinUser.h, there are the following definitions:

    #ifdef UNICODE#define CreateWindowEx  CreateWindowExW#else#define CreateWindowEx  CreateWindowExA#endif // !UNICODE

    Based on the above header file, we know that the CREATEWINDOWEXW function supports Unicode characters, while Createwindowexa supports ANSI characters. The original Windows function will also take into account the ANSI and Unicode string problems, so in order to be compatible with both, it is classified as the CreateWindowEx function, will automatically select the correct function according to the situation itself. In fact, there is an internal principle: the Createwindowexa function is actually implemented inside a transformation layer, which is responsible for allocating memory in order to convert the ANSI string to a Unicode string, and then the internal code calls CREATEWINDOWEXW. and passes the converted string to it, when Createwindowexw returns, Createwindowexa releases its memory buffer and returns the window handle. This internal principle, summed up a sentence is that although we call Createwindowexa, but the actual function is to convert the ANSI string to a Unicode string, then call CREATEWINDOWEXW, finally freed memory, Then Createwindowexa returns the window handle returned by the internally called CREATEWINDOWEXW.

    The ANSI and Unicode functions of the C Runtime library

    The C Runtime library provides some string manipulation functions to handle ANSI and Unicode characters. For example: Strlen and Wcslen functions that support ANSI strings and Unicode strings, respectively.

    //字符集为Unicode字符集char szBuffer1[5]="abcd";printf("%d\n", strlen(szBuffer1));TCHAR szBuffer2[5] = TEXT("abcd");printf("%d\n", wcslen(szBuffer2));

    The C Runtime Library, in order to be smart compatible with ANSI and UNICODE, provides the _tcslen function, which requires a header file Tchar.h and defines the _unicode.
    The Tchar.h header file defines the following macros:

    #ifdef _UNICODE#define _tcslen wcslen#else#define _tcslen strlen#endif

    If you include the header file Tchar.h, and the character set as the UNICODE character set, the _UNICODE is already defined, and I don't know why the Unicode character set is automatically defined _unicode and then you can use _tcslen directly. Maybe it's because the character set of this operation has a # define _UNICODE line code inside it. Here's an example:

    //已经设置字符集为Unicode字符集了#include<windows.h>#include<tchar.h>int main(){TCHAR szBuffer3[5] = TEXT("abcd");printf("%d\n", _tcslen(szBuffer3));system("pause");return 0;}
    C Run-Library security String functions

    When we are programming, try to use a secure string, for example strcpy is a non-security function, when you use this function in the program, you will find that the compiler will appear warning, while giving advice, please follow. The compiler will prompt us to use the strcpy_s function, at which point we can look up the function and find the TCHAR.h version of the function. It is not very difficult to use the method, you have to use that string, you can find the appropriate security string function in MSDN. However, there are no problems with functions such as strlen, Wcslen, and _tcslen, which can be used with confidence because they do not modify the incoming string.

    C Run-in security string function (Advanced version)

    The C run-time library also has new functions to provide more control when performing string processing. For example: functions such as Stringcchlength, stringcchprintf, and more, refer to MSDN.
    The following is a description of the stringcchprintf function:
    The stringcchprintf function is used to write a formatted string to the specified buffer, which differs from the wsprintf function in that the function also needs to provide the size of the target buffer to ensure that no cross-border access occurs. (Because the wsprintf function, if the buffer size is not sufficient to store the formatted string, the write is not allowed and crashes occur.) However, the stringcchprintf function specifies the target buffer size, meaning that the buffer size is not sufficient to store the formatted string, or it can be truncated, storing only the string of the length of parameter 1 (buffer size), thus avoiding a crash. Header file strsafe.h.

Function Prototypes:
HRESULT stringcchprintf (
out LPTSTR Pszdest,
in size_t cchdest,
inch LPCTSTR Pszformat,
in ...
);

Parameter 1: Specify the buffer that will be written
Parameter 2: Limit buffer size
Parameter 3: Formatting strings
Parameter 4: Variable parameters

    TCHAR szBuffer[10];    wsprintf(szBuffer, TEXT("%s"), TEXT("woainiaifbgfbfgbfgbgf"));//当目标缓冲区不够存储源缓冲区内容,则会溢出崩溃    StringCchPrintf(szBuffer, 10, TEXT("%s"),TEXT("wwoainiaifbgfbfgbfgbgf"));//新的安全字符串函数增加了一个缓冲区大小参数,如果超过目标缓冲区大小则会自动截断,避免了溢出崩溃

The following is a description of the Stringcchlength function:
The Stringcchlength function is used to determine whether a string exceeds the specified length. The difference from the Lstrlen function is that the function specifies the maximum allowable number of characters for the string to be checked. Note that if the string to be checked (double quotation marks) is longer than the maximum allowable number of characters, the parameter 3 is set to 0. If the string to be checked (single quotation marks) does not have a string terminator, the parameter 3 is set to 0 regardless of the number of Cchmax.

* function Prototype:
HRESULT Stringcchlength (
inch LPCTSTR Psz,
in size_t Cchmax,
Out size_t
pcch
);
Parameter 1: Point to the string to be checked
The maximum number of characters allowed in the parameter 2:psz parameter.
Parameter 3: The number of characters in the string, not including ' + ' * *

    size_t iTarget1,iTarget2,iTarget3;    TCHAR szBuffer1[10] =TEXT("但是我依然很开心呀");    StringCchLength(szBuffer1, 5, &iTarget1);//如果待检查的字符串(双引号)长度大于最大允许的字符数量,参数3置为0。不会报错。    TCHAR szBuffer2[3] = { L‘a‘, L‘b‘, L‘c‘ };    StringCchLength(szBuffer2, 5, &iTarget2);//如果待检查的字符串(单引号),没有字符串结束符,则无论设置cchMax为多少,都会置参数3为0.不会报错。    TCHAR szBuffer3[10] = TEXT("但是我依然很开心呀");    StringCchLength(szBuffer3, 10, &iTarget3);//成功了

In summary, the functions of the stringcch* series are safe because you can specify how to truncate and not crash.

string comparison function
int CompareString(  __in  LCID Locale,  __in  DWORD dwCmpFlags,  __in  LPCTSTR lpString1,  __in  int cchCount1,  __in  LPCTSTR lpString2,  __in  int cchCount2);
int CompareStringOrdinal(  __in  LPCWSTR lpString1,  __in  int cchCount1,  __in  LPCWSTR lpString2,  __in  int cchCount2,  __in  BOOL bIgnoreCase);

Comparestringordina and language-independent, faster!!! Recommended Use!!! Because the string manipulation function in the actual application can query MSDN, I will fill it back later.

Conversion functions between wide characters and ASCII characters
int MultiByteToWideChar(  __in   UINT CodePage,  __in   DWORD dwFlags,  __in   LPCSTR lpMultiByteStr,  __in   int cbMultiByte,  __out  LPWSTR lpWideCharStr,  __in   int cchWideChar);
int WideCharToMultiByte(  __in   UINT CodePage,  __in   DWORD dwFlags,  __in   LPCWSTR lpWideCharStr,  __in   int cchWideChar,  __out  LPSTR lpMultiByteStr,  __in   int cbMultiByte,  __in   LPCSTR lpDefaultChar,  __out  LPBOOL lpUsedDefaultChar);

Because the string manipulation function in the actual application can query MSDN, I will fill it back later.

Core summary of Windows core programming (chapter II character and string processing) (2018.5.27)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.