Differences between UNICODE and ANSI

Last Update:2018-12-07 Source: Internet

Author: User

Tags uppercase letter

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is ANSI and what is UNICODE? In fact, these are two different encoding methods. ANSI adopts 8 bits, while UNICODE uses 16 bits. (For characters, ANSI stores English characters in a single byte and Chinese characters in double byte, while for Unicode, English and Chinese characters are both in double byte) unicode code is also an international standard and uses two-byte encoding, which is incompatible with ANSI code. Currently, it has been applied in networks, Windows systems, and many large software. 8-bit ANSI encoding can only represent 256 characters, indicating that 26 English letters are more than enough, but it is not enough to represent non-Western characters with thousands of characters, such as Chinese characters and Korean letters, in this way, the UNICODE standard is introduced.
In software development, especially some functions related to string processing in C language, ANSI and UNICODE are used for distinguishing. How can we define ANSI and UNICODE characters, how to use it? How can we convert ANSI and UNICODE?
I. Definition:
ANSI: char str [1024]; available string processing functions: strcpy (), strcat (), strlen (), and so on.
UNICODE: wchar_t str [1024]; string processing functions available
Ii. Available functions:
ANSI: char. Available string processing functions: strcat (), strcpy (), strlen (), and other functions with str headers.
UNICODE: the available string processing functions of wchar_t: functions such as wcscat (), wcscpy (), and wcslen () that are headers with wcs.
Iii. System Support
Windows 98: only ANSI is supported.
Windows 2 k: supports both ANSI and UNICODE.
Windows CE: Only UNICODE is supported.
Description
1. Only UNICODE is supported in COM.
2. in Windows 2000, the entire OS system is UNICODE-based. Therefore, using ANSI in windows 2000 requires a price. Although no conversion is required for encoding, this conversion is hidden, CPU and memory are occupied by system resources ).
3. If UNICODE must be used in Windows 98, you must manually switch the encoding.
4. How to differentiate:
In our software development, we often need to support ANSI and UNICODE. It is impossible to re-change the string type and use the string operation functions when type conversion is required. Therefore, the standard C Runtime Library and Windows provide macro-defined methods.
_ UNICODE macros (with underscores) are provided in the C language, and UNICODE macros (without underscores) are provided in Windows. If _ UNICODE macros and UNICODE macros are specified, the system automatically switches to the UNICODE version. Otherwise, the system compiles and runs in ANSI mode.
Only macros are defined and cannot be automatically converted. It also requires support for a series of character definitions.
1. TCHAR
If a UNICODE macro is defined, TCHAR is defined as wchar_t.
Typedef wchar_t TCHAR;
Otherwise, TCHAR is defined as char.
Typedef char TCHAR;
2. LPTSTR
If a UNICODE macro is defined, LPTSTR is defined as LPWSTR. (I have never known what LPWSTR is before and finally understood it)
Typedef lptstr lpwstr;
Otherwise, TCHAR is defined as char.
Typedef lptstr lpstr;
Add:
UTF-8 can be used for real stream transmission, while Unicode is an encoding scheme
In my understanding, UTF-8 is a specific implementation of Unicode. Similar implementations include UTF-16 and so on.

ANSI/Unicode characters and strings
TChar. h is a String. h modification used to create ANSI/Unicode generic strings.

Each character in a Unicode string is a 16-Bit String.

Win9x only supports ANSI; Win2000/XP/2003 supports ANSI/Unicode; WinCE only supports Unicode
Appendix: Some Unicode functions can also be used in Win9X, but unexpected errors may occur.

Wchar_t is the data type of Unicode characters.

All Unicode functions start with the wcs and all ANSI functions start with the str. ansi c specifies that the C Runtime Library supports both ANSI and Unicode.
ANSI Unicode
Char * strcat (char *, const char *) wchar_t * wcscat (wchar_t *, const wchar_t *)
Char * strchr (const char *, int) wchar_t * wcschr (const wchar_t *, int)
Int strcmp (const char *, const char *) int wcscmp (const wchar_t *, const wchar_t *)
Char * strcpy (char *, const char *) wchar_t * wcscpy (wchar_t *, const wchar_t *)
Size_t strlen (const char *) wchar_t wcslen (const wchar_t *)

L "wash": used to convert an ANSI string to a Unicode string;
_ TEXT ("wash") is converted based on whether Unicode or _ Unicode is defined.
Appendix: _ Unicode is used for the C Runtime Library; Unicode is used for the Windows header file.

ANSI/Unicode Common Data Types
Both (ANSI/Unicode) ANSI Unicode
Maid
LPTSTR LPSTR LPWSTR
PCTSTR PCSTR PCWSTR
PTSTR PSTR PWSTR
TBYTE (TCHAR) CHAR WCHAR

It is best to provide ANSI and Unicode functions when designing dll. ANSI functions are only used to allocate memory, convert characters to Unicode characters, and then call Unicode functions.

It is best to use operating system functions, less or less practical C runtime functions
Eg: Operating System string function (shlWApi. h)
StrCat (), StrChr (), StrCmp (), StrCpy (), etc.
Note that they are case sensitive and they are also ANSI and Unicode versions.
Appendix: ANSI functions are appended with an uppercase letter A after the original function
The Unicode function adds W uppercase letters after the original function.

Become an ANSI and Unicode-compliant Function
? Treat a text string as a character array instead of a c h a r s array or a byte array.
? Use common data types (such as t c h a r and p t s t r) for text characters and strings.
? Apply explicit data types (such as B y t e and p B Y T E) to byte, byte pointer, and data cache.
? Use the t e x t macro to use the original characters and strings.
? Modifying string operations.
For example, sizeof (szBuffer)-> sizeof (szBuffer)/sizeof (TCHAR)
Malloc (charNum)-> malloc (charNum * sizeof (TCHAR ))

Functions for Unicode Character operations include: (ANSI and Unicode versions are also available)
Lstrcat (), lstrcmp ()/lstrcmpi () [they call CompareString () internally], lstrcpy (), lstrlen ()
These are implemented as macros.

C runtime functions windows functions
Tolower () PTSTR CharLower (PTSTR pszString)
Toupper () PTSTR CharUpper (PTSTR pszString)
Isalpha () BOOL IsCharAlpha (TCHAR ch)
BOOL ISCharAlphaNumeric (TCHAR ch)
Islower () BOOL IsCharLower (TCHAR ch)
Isupper () BOOL IsCharUpper (TCHAR ch)
Print () wsprintf ()
Convert Buffer: DWORD CharLowerBuffer (PTSTR pszString, DWORD cchString)
DWORD CharUpperBuffer (PTSTR pszString, DWORD cchString)
You can also convert a single character, for example, TCHAR cLowerCaseChar = CharLower (PTSTR) szString [0])

Determines whether the character is ANSI or Unicode.
BOOL IsTextUnicode (
Const VOID * pBuffer, // input buffer to be examined
Int cb, // size of input buffer
LPINT lpi // options
)
Appendix: this function is in the Win9x system. If no code is implemented, FALSE is always returned.

Conversion between Unicode and ANSI
Char szA [40];
Wchar szW [40];
// Normal sprintf: all string are ANSI
Sprintf (szA, "% s", "ANSI str ");
// Convert Unicode string to ANSI
Sprintf (szA, "% S", L "Unicode str ");
// Normal swprintf: all string are unicode
Swprinf (szW, "% s", L "Unicode str ");
// Convert ANSI String to Unicode
Swprinf (szW, L "% S", "ANSI str ");

Int MultiByteToWideChar (
UINT uCodePage, // coDe page, 0
DWORD dwFlags, // character-type options, 0
PCSTR pMultiByte, // source string Addr
Int cchMultiByte, // source string byte length
PWSTR pWideCharStr, // Dest string Addr
Int cchWideChar // Dest string char Nums
)
The u C o d e P a g e parameter is used to identify a code page number related to a multi-byte string. The d w F l a g s parameter is used to set another control, which can affect characters by distinguishing characters such as accents. These flags are generally not used and 0 is passed in the d w F l a g s parameter. The p M u l t I B y t e S t r parameter is used to set the string to be converted, the c h M u l t I B y t e parameter is used to specify the length of the string (in bytes ). If it is passed-1 for the c h M u l t I B y t e parameter, this function is used to determine the length of the source string. The converted U n I c o d e version string will be written to the cache in the memory. Its address is specified by the p Wi d e C h a r S t r parameter. The maximum value of the cache must be set in the c h Wi d e C h a r parameter (measured in characters ). If you call M u l t I B y t e To Wi d e C h a r, pass 0 To the c h Wi d e C h a r parameter, this parameter will not perform String Conversion, but return the cached value required for successful conversion.

To convert a multi-byte string to a U n I c o d e Equivalent string, follow these steps:
1) Call the M u l t I B y t e To Wi d e C h a r function, for p Wi d e C h a r S t r parameter transfer n u l, for c h Wi d e C h a r parameter transfer 0.
2) allocate enough memory blocks to store the converted U n I c o d e string. The size of the memory block is returned from the call To M u l t B y t e To Wi d e C h a r.
3) Call M u l t I B y t e To Wi d e C h a r again, this time, the cached address is passed as the p Wi d e C h a r S t r parameter, and pass the cache size returned when M u l t I B y t e To Wi d e C h a r is called for the first time as the c h Wi d e c h a r parameter.
4) use the converted string.
5) release the memory block occupied by the U n I c o d e string.

Int WideCharToMultiByte (
UINT CodePage, // coDe page
DWORD dwFlags, // performance and mapping flags
LPCWSTR lpWideCharStr, // wide-character string
Int cchWideChar, // number of chars in string
LPSTR lpMultiByteStr, // buffer for new string
Int cbMultiByte, // size of buffer
Lpstr lpDefaultChar, // default for unmappable chars
LPBOOL lpUsedDefaultChar // set when default char used
)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More