ANSI and Uincode encoding

Source: Internet
Author: User

Brief description:
    • ANSI is a character code that, for the computer to support more languages, typically uses 2 bytes of the 0x80~0xff range to represent 1 characters.
    • Uincode (Unified code, universal code, single Code) is an industry standard in the field of computer science, including character set, encoding scheme, etc. Unicode is created to address the limitations of traditional character encoding schemes, which set a uniform and unique binary encoding for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing. Research and development began in 1990, officially announced in 1994.

Advantages and Disadvantages
    • ANSI uses char to represent one character and takes up a byte of storage space. So ANSI character code supports up to 255 characters, which means English is also possible, but it is not enough for Chinese, Japanese, Korean and other languages.
    • Uincode uses the unsigned short to represent a character, defined as wchar_t, which occupies two bytes of storage space. So the Uincode character code basically supports 90% natural languages. Disadvantage: Double the space consumption, network transmission is large.

ANSI Code = Narrow character

Uincode = wide Character

Character code data type:

Ansi: Char,char * ,const char * C + +

CHAR , (PCHAR, PSTR, LPSTR), LPCSTR VC + +

Unicode: wchar_t,wchar_t * ,const wchar_t *

     WCHAR , (Pwchar, Pwstr, LPWSTR), LPCWSTR

T Common types: TCHAR, (TCHAR *, Ptchar, PTSTR, LPTSTR), LPCTSTR

Above, where: P represents the meaning of the pointer, str stands for the meaning of the string, L is the meaning of the long pointer, can be ignored under the WIN32 platform, C stands for the meaning of the const constant, W stands for the meaning of the wide wide byte, t everyone can understand the meaning of the universal type,

The common type is Microsoft for the convenience of using the definition of the universal character type, in different encoding environment, depending on whether the definition of _uincode macros, automatically converted to char or wchar_t;

The definition of an object of type string:

Ansi:char *pansistr = "Hello";

unicode:wchar_t *punicodestr = L"Hello";

Generic type: TCHAR *ptstr = _t("Hello"); or TCHAR *ptstr = _text("Hello");

Dynamic Request Memory: TCHAR *pszbuf = new tchar[100]; //identifiers are important

Among them, _text and _t are the same, defined as follows:

#define _T (x)       __t (x) #define _TEXT (x)    __t (x)//See the final definition of __t: #ifdef  _unicode #define __T (x)      l# #x//Turn Change Uincode#else #define __T (x)      x//equals itself #endif

where, # #为连接起来的意思.

Commonly used string handling functions, see MSDN for specific information:

String length:

Ansi:strlen (char *str); Gets the string length, CS is the CString abbreviation, and Len is the length,w,_t string type. Easy to remember. You can also query MSDN

Unicode:wcslen (wchar_t *str);

General function: _tcslen (TCHAR *str);

Ansi:int atoi (const char *STR); //converted to digital,atoi,_wtoi,_tstoi . can be memory according to different colors. The +to+ conversion type is the string type, respectively.

Unicode:int _wtoi (const wchar_t *STR);

General function: _tstoi (const TCHAR *STR);

String copy:

ansi:strcpy (char *strdestination, const char *strsource); Gets the string length, CS is the CString abbreviation, and cpy is the copy,w,_t string type. Easy to remember. You can also query MSDN

unicode:wcscpy (wchar_t *strdestination, const wchar_t *strsource);

General functions: _tcscpy (TCHAR *strdestination, const TCHAR *strsource);

The above function is unsafe, there will be a warnning warning in the compiler of vs2003 and above, the following is a security function (vc++6.0 not supported):

ansi:strcpy_s (char *strdestination, size_t numberofelements, const char *strsource),//_s can be understood as safe abbreviation, insurance.

unicode:wcscpy_s (wchar_t *strdestination, size_t numberofelements, const wchar_t *strsource);

General functions: _tcscpy_s (TCHAR *strdestination, size_t numberofelements, const TCHAR *strsource);

numberOfElements Size of the destination string buffer. The size of the destination buffer, in bytes, is not a character!

size_t unsigned integer, explained in MSDN: Result of sizeof operator, which means that size_t is a unsigned integer that is an unsigned integer. So why do you have size_t this type? Because the Int/long and other types in the operating system (32/64) of different platforms are not the same, size_t have different definitions under different platforms. Somewhat similar to the TCHAR type:

#ifndef   _size_t_defined   #ifdef     _win64   typedef   unsigned   __int64         size_t;   8 bytes, 64-bit  #else   typedef   _W64   unsigned   int       size_t;   4 bytes, 32-bit   #endif   #define   _size_t_defined #endif

The number of bytes consumed by the string:

Ansi:

Char szstr[] = "ABC";

Take the number of bytes to find the method: sizeof (SZSTR);

Char *psz = "DEFGH";

Number of bytes consumed:strlen(psz) *sizeof (char);

Unicode:

wchar_t szwstr[] = L "abc";

Take the number of bytes to find the method: sizeof (szwstr);

wchar_t *pwsz = L "DEFGH";

Number of bytes occupied:wcslen(pwsz) *sizeof (wchar_t);

General functions:

TCHAR szstr[] = _t ("abc");

Take the number of bytes to find the method: sizeof (SZSTR);

TCHAR *psz = _t ("Defgh");

Number of bytes occupied:_tcslen(psz) *sizeof (TCHAR);

The most fundamental API function to use for conversion:

Widechar The tomultibyte implements a wide-byte conversion to the narrow-byte//function parameter self-querying MSDN.

multibyte ToWidechar Narrow-byte conversion to wide-byte

The WideCharToMultiByte code page is used to mark the code page associated with the newly converted string;

The MultiByteToWideChar code page is used to mark a code page related to a multibyte string.

[1], the common code page has CP_ACP and Cp_utf8 Two: The use of CP_ACP code page to achieve the conversion between ANSI and Unicode;

The conversion between UTF-8 and Unicode is achieved using the Cp_utf8 code page.

[2], the DwFlags parameter allows us to carry out additional control, but, in general, do not use this flag, directly pass 0 on the line.

[3], Lpdefaultchar and Pfuseddefaultchar:

The WideCharToMultiByte function uses these two parameters only if the WideCharToMultiByte function encounters a wide-byte character that does not have its representation in the code page that the Ucodepage parameter identifies. If a wide-byte character cannot be converted, the function uses the character pointed to by the Lpdefaultchar argument. If the parameter is null (this is the parameter value in most cases), then the function uses the system's default character. The default character is usually a question mark. This is dangerous for the file name because the question mark is a wildcard character. The Pfuseddefaultchar parameter points to a Boolean variable that, if at least one character in a Unicode string cannot be converted to an equivalent multibyte character, then the function resets the variable to true. If all characters are successfully converted, the function will set the variable to False. The variable can be tested when the function returns to check whether a wide-byte string has been successfully converted.

Examples of the use of two conversion functions:

Char *cctrywidechartoansi (wchar_t *pwidechar) {if (!pwidechar) return Null;char *pszbuf = null;int needbytes = WideCharToM Ultibyte (CP_ACP, 0, Pwidechar,-1, NULL, 0, NULL, NULL), if (Needbytes > 0) {pszbuf = new char[needbytes+1]; ZeroMemory (Pszbuf, (needbytes+1) *sizeof (char)); WideCharToMultiByte (CP_ACP, 0, Pwidechar,-1, pszbuf, needbytes, NULL, NULL);} return pszbuf;} wchar_t *cctryansichartowide (char *pchar) {if (!pchar) return null;wchar_t *pszbuf = null;int Needwchar = MultiByteToWideC Har (CP_ACP, 0, PChar,-1, NULL, 0); if (Needwchar > 0) {pszbuf = new wchar_t[needwchar+1]; ZeroMemory (Pszbuf, needwchar+1); MultiByteToWideChar (CP_ACP, 0, PChar,-1, Pszbuf, Needwchar);} return pszbuf;}

Don't forget to free up space after use

Macro Conversion

A2W, W2A, T2A, T2W macro use and Precautions:
[1], using the Alloca () function for space applications, the macro return address space is from the stack above the application, then do not have to release, so that involves a scope of the problem, see MSDN, specifically,
You can simply be understood as "backwards compatible."
[2], do not in a function of the loop body using a2w characters such as conversion macros, may cause stack overflow.
Like what:

#include <atlconv.h>void func () {    while (true)    {        {            uses_conversion;            TestFunc (A2W ("abc"));}}}    

ANSI and Uincode encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.