Conversion between strings and numbers (Unicode)

Source: Internet
Author: User

1 Unicode-encoded strings converted to numeric types

CString  str;str = _t ("1234"); int i = _ttoi (str); float f = _tstof (str);
2 Number converted to wchar_t

wchar_t c[10];int num = 100;_itow_s (num,c,10,10 binary); wstring str (c);
3 wstring conversion to int

Wstring Str;_wtoi (STR.C_STR);
So what exactly is Unicode? How to program it?

Unicode environment settings
When you install Visual Studio, you need to include the Unicode option when choosing VC + + to ensure that the relevant library files can be copied to system32.


Unicode compilation settings:
C + +, preprocessor difinitions remove _mbcs, add _unicode,unicode
Set entry to wWinMainCRTStartup in Projectsetting/link/output
The inverse is MBCS (ANSI) compilation.


Unicode: Wide-byte character set


1. How do I get the number of characters that contain both a single-byte character and a double-byte character string?
You can call the run-time library of Microsoft Visual C + + to include the function _mbslen to manipulate multibyte (including both single-byte and double-byte) strings.
Calling the Strlen function does not really understand how many characters are in a string, it can only tell you how many bytes were before the end of 0.


2. How do I manipulate a DBCS (double-byte character set) string?
Function description
Ptstr Charnext (LPCTSTR); Returns the address of the next character of a string
Ptstr Charprev (LPCTSTR, LPCTSTR); Returns the address of one of the characters in the string
BOOL Isdbcsleadbyte (byte); Returns a value other than 0 if the byte is the first byte of a DBCS character


3. Why use Unicode?
(1) It is easy to exchange data between different languages.
(2) enables you to allocate a single binary. exe file or DLL file that supports all languages.
(3) Improve the operation efficiency of the application.
Windows 2000 was developed from scratch using Unicode, and if you call any of the Windows functions and pass an ANSI string to it, the system first converts the string to Unicode, and then passes the Unicode string to the operating system. If you want the function to return an ANSI string, the system first converts the Unicode string to an ANSI string, and then returns the result to your application. The conversion of these strings takes up the time and memory of the system. By developing an application from scratch with Unicode, you can make your application run more efficiently.
Windows CE itself is an operating system that uses Unicode and does not support ANSI Windows functions at all
Windows 98 only supports ANSI and can only develop applications for ANSI.
When Microsoft Corporation converted COM from 16-bit Windows to Win32, the company decided that all COM interface methods that needed strings would accept only Unicode strings.


4. How do I write Unicode source code?
Microsoft Corporation has designed WINDOWSAPI for Unicode so that it can minimize the impact of the code. In fact, you can write a single source code file to compile it using or not using Unicode. To define only two macros (UNICODE and _UNICODE), you can modify and recompile the source file.
_UNICODE macros are used for C run-time header files, while Unicode macros are used for Windows header files. When you compile a source code module, you typically have to define both macros at the same time.


5. What are the Unicode data types defined by Windows?
Data type description
WCHAR Unicode characters
Pwstr pointer to a Unicode string
Pcwstr pointer to a constant Unicode string
The corresponding ANSI data types are CHAR,LPSTR and LPCSTR.
The Ansi/unicode universal data type is TCHAR,PTSTR,LPCTSTR.


6. How do I work with Unicode?
Character Set attribute instances
ANSI operation function starts with STR strcpy
The Unicode action function starts with WCS wcscpy
The MBCS action function starts with _mbs _mbscpy
Ansi/unicode action function starts with _tcs _tcscpy (C run-time library)
Ansi/unicode action function starts with LSTR lstrcpy (Windows functions)
All new and obsolete functions have both ANSI and Unicode two versions in Windows2000. The end of the ANSI version function is represented by a, and the Unicode version function ends with W. Windows is defined as follows:
#ifdef UNICODE
#define CREATEWINDOWEX CREATEWINDOWEXW
#else
#define CREATEWINDOWEX Createwindowexa
#endif//! Unicode


7. How do I represent Unicode string constants?
Character Set instances
ANSI "string"
Unicode L "string"
Ansi/unicode T ("string") or _text ("string") if (szerror[0] = = _text (' J ')) {}


8. Why should I use operating system functions as much as possible?
This will help slightly improve the performance of your application because the operating system string functions are often used by large applications such as the shell process of the operating system Explorer.exe. Because these functions are used so much, they may have been loaded into RAM while the application is running.
such as: strcat,strchr,strcmp and strcpy and so on.


9. How do I write an ANSI-and Unicode-compliant application?
(1) treats a text string as an array of characters rather than a chars array or a byte array.
(2) Use common data types (such as TCHAR and PTSTR) for text characters and strings.
(3) Use explicit data types (such as Byte and Pbyte) for byte, byte pointers, and data caches.
(4) Use the text macro for literal characters and strings.
(5) Perform a global substitution (for example, replace PSTR with PTSTR).
(6) Modify the string arithmetic problem. For example, a function typically wants to pass a cache size in a character, rather than a byte. This means that sizeof (szbuffer) should not be passed, but should be passed (sizeof (szbuffer)/sizeof (TCHAR). In addition, if you need to assign a memory block to a string and have the number of characters in that string, remember to allocate memory in bytes. This means that you should call
malloc (ncharacters *sizeof (TCHAR)) instead of calling malloc (Ncharacters).


10. How do I make a selective comparison of strings?
implemented by calling CompareString.
Logo meaning
Norm_ignorecase ignores the case of letters
Norm_ignorekanatype does not distinguish between hiragana and katakana characters
Norm_ignorenonspace Ignore no spacing characters
Norm_ignoresymbols Ignore Symbols
Norm_ignorewidth does not differentiate between single-byte characters and the same character as double-byte characters
Sort_stringsort to handle punctuation as a normal symbol


11. How can I tell if a text file is ANSI or Unicode?
Judge if the first two bytes of the text file are 0xFF and 0xFE, that is Unicode, otherwise ANSI.


12. How can I tell if a string is ANSI or Unicode?
Use Istextunicode to judge. Istextunicode uses a series of statistical methods and qualitative methods to guess the contents of the cache. Since this is not an exact scientific method, it is possible for Istextunicode to return incorrect results.


13. How do I convert a string between Unicode and ANSI?
The Windows function MultiByteToWideChar is used to convert a multibyte string into a wide string; the function WideCharToMultiByte converts a wide string into an equivalent multibyte string.


The difference between Unicode and DBCS
Unicode uses the "wide character set" (especially in the context of C programming languages). Each character in the "unicode is 16 bits wide instead of 8 bits wide. "In Unicode, there is no use of a 8-bit numeric value alone. In contrast, we still handle 8-bit values in the double-bit group character set. Some bit groups themselves define characters, while some bit groups display the need to define a character together with another bit group.
Handling DBCS strings is messy, but working with Unicode literals is like working with ordered text. You might be happy to know that the first 128 Unicode characters (16-bit code from 0x0000 to 0x007f) are ASCII characters, and the next 128 Unicode characters (code from 0x0080 to 0X00FF) are ISO 8859-1 extensions to ASCII. Characters in different parts of Unicode are also based on existing standards. This is for ease of conversion. The Greek alphabet uses code from 0x0370 to 0x03ff, Slavic uses code from 0x0400 to 0X04FF, the United States uses code from 0x0530 to 0x058f, and Hebrew uses code from 0x0590 to 0X05FF. Chinese, Japanese, and Korean hieroglyphs (collectively called CJK) occupy code from 0x3000 to 0X9FFF. The biggest benefit of Unicode is that there is only one character set, no ambiguity.


15. Derivative standards
Unicode is a standard. UTF-8 is a subset of its concepts, and UTF-8 is a specific coding standard. And Unicode is all the standards that want to achieve the world's unified coding standards. The UTF-8 standard is a variant of the Unicode (ISO10646) standard,
UTF's full name is: Unicode/ucs transformation Format, there are two kinds of UTF, one is UTF-8, one is UTF-16,
However, UTF-16 is less used and its correspondence is as follows:
In Unicode encoded as 0000-007f, the UTF-8 is encoded in the form: 0xxxxxxx
In Unicode encoded as 0080-07FF, the UTF-8 is encoded in the form: 110xxxxx 10xxxxxx
Encoded in UTF-8 encoded as 0000-007f in Unicode: 1110xxxx 10xxxxxx 10xxxxxx


Utf-8 is a new encoding standard for Unicode, in fact there are several standards for Unicode. We know that Unicode character codes used all the time are 16-bit, and it doesn't actually put all the characters of the world in a flat system, such as Chinese Tibetan and other small languages, So Utf-8 expands to 32 bits, which means that the theory can hold two of 32 characters in Utf-8. The idea of Unicode is to unify all the characters and to achieve a unified standard. BIG5, GB are independent character sets, this is also called the Far East character set, it will be brought to the German version of Windows may cause character encoding conflict .... The early Windows default character set is the native encoding of the characters entered in Ansi.notepad, but Unicode can be directly supported within nt/2000. Notepad.exe are ANSI characters in WIN95 and 98, UNICODE in NT. ANSI and Unicode can easily implement the corresponding mapping, that is, the conversion of ASCII is a 8-bit range of character sets, for the outside of the range of characters such as Kanji it is not expressed. Unicode is a 16-bit range of character sets, and Unicode is a character encoding standard developed by multiple it giants for character partitioning in different regions. If a character occupies two bytes and 16 bits in a Unicode environment such as Windows NT, the next character in the ANSI environment, such as WINDOWS98, occupies a byte of 8 bits. The Unicode character is 16 bits wide, allows a maximum of 65,535 characters, and the data type is called WCHAR.
For existing ANSI characters, Unicode simply expands it to 16 bits: for example ANSI "A" =0x43, the corresponding Unicode is
"A" = 0x0043
ASCII is a true American standard with seven for 128 characters, so it does not meet the needs of other countries, such as the Cyrillic alphabet and Chinese characters that appear in the Windows ANSI character set, an extended ASCII code with 8-bit characters, The lower 128 bits still hold the original ASCII code,
and the high 128-bit added the Greek alphabet and so on
If Def UNICODE
TCHAR = WCHAR
Else
TCHAR = Char
You need to add UNICODE and _unicode in the Project/settings/c/c++/preprocesser definitions
Uincode,_unicode are to be defined. If you do not define _UNICODE, use SetText (HWND,LPCTSTR), which will be interpreted as settexta (HWND,LPTSTR), then the API will treat your Unicode string as an ANSI string, displaying garbled characters. Because the Windows API is already compiled in the DLL, because regardless of Unicode or ANSI string, it is considered a buffer, such as "0B A3 3C 00 00" If read by ANSI, because the ANSI string is "/0" end, So can only read to two bytes "0B a3/0", if read by Unicode, will complete read to '/0/0 ' end.
Because Unicode does not have an additional indicator bit, the system must know what format you provide for the string. In addition, Unicode seems to be prescribed by ANSI C + +, _unicode is provided by the Windows SDK. If you do not write a Windows program, you can define Unicode only.


Development process:
Expands on file read and write, string processing. There are two main types of files:. txt and. ini files
1. In Unicode and non-Unicode environments where strings are handled differently, it is necessary to refer to the above 9 and 102 to accommodate the string processing requirements of different environments.
The same is true for read and write files. Whenever a related interface function is called, the string in the argument is preceded by a related macro such as _text. If the file you are writing needs to be saved in Unicode format, you will need to add a byte header when creating the file.
CFile file;
WCHAR szwbuffer[128];

WCHAR *pszunicode = L "Unicode string/n"; Unicode string
CHAR *pszansi = "Ansi string/n"; ANSI String
WORD wsignature = 0xFEFF;

File. Open (TEXT ("Test.txt"), cfile::modecreate| Cfile::modewrite);

File. Write (&wsignature, 2);

File. Write (Pszunicode, Lstrlenw (pszunicode) * sizeof (WCHAR));
Explicitly use LSTRLENW function

MultiByteToWideChar (CP_ACP, 0, Pszansi,-1, Szwbuffer, 128);

File. Write (Szwbuffer, Lstrlenw (szwbuffer) * sizeof (WCHAR));

File. Close ();
The above code is valid in both Unicode and non-Unicode environments. This explicitly indicates that the operation is done using Unicode.


2. In a non-Unicode environment, the default invocation is a string in ANSI format, at which point the TCHAR is converted to char type unless WCHAR is explicitly defined. So in this environment, if you read a Unicode file, you first need to move 2 bytes, and then the read-get string needs to be converted with MultiByteToWideChar, and the string information is converted to represent Unicode data.


3. In the Unicode environment, the default call is the Unicode format of the string, that is, the wide character, at this time Tchar converted to WCHAR, the relevant API functions are also called wide character type functions. At this point the read Unicode file is also the same as above, but the read fetch data is WCHAR, if you want to convert to ANSI format, you need to call WideCharToMultiByte. If you read ANSI, you do not need to move two bytes, read directly and then convert as needed.


Some languages (such as Korean) must be displayed in a Unicode environment, in which case, in a non-Unicode environment, even if a string function conversion is not possible to achieve the purpose of displaying text, Because the API functions that are called at this point are ANSI (although the underlying is processed in Unicode, the results are shown in the API called by the programmer). So it has to be developed in Unicode.

Conversion between strings and numbers (Unicode)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.