[Zz] Unicode knowledge and skills

Source: Internet
Author: User
Document directory
  • Unicode compilation settings:
  • UNICODE: Wide-Byte Character Set
  • Development Process:

Unicode macro and _ Unicode macro

In Windows programming, Unicode programs are often compiled by adding Unicode or _ Unicode compiling conditions to the project file configuration. Which one is used?

Jeffrey Richter said in Windows core programming, _ Unicode macro is used for C Runtime header files, while Unicode macro is used for Windows header filesWhen compiling the source code module, the two macros must be defined at the same time. What exactly is this?

I searched the header file of MFC and found this code in the afxv_w32.h file:

# Ifdef _ Unicode

# Ifndef Unicode

# Define Unicode

# Endif

# Endif

 

# Ifdef Unicode

# Ifndef _ Unicode

# DEFINE _ Unicode

# Endif

# Endif

 

Therefore, in the MFC program, you only need to set one of the two.

 

However, in the SDK program, I found that Unicode is the most frequently used in the header file search. _ Unicode is only in a few files, and there are not many areas to define each other, therefore, to compile the SDK program into a unicode program, you only need to set the Unicode macro.

This article comes from the hacker Manual
Link: http://www.nohack.cn/code/other/2006-10-05/8850.html

////////////////

[It168 Knowledge Base]

A collection of Unicode encoding and programming skills

1. Regular Expressions matching Unicode characters

Original article: http://tech.it168.com/KnowledgeBase/Articles/2/2/0/220fcae070b4f62461e3e99e17e30306.htm

Here are several main non-English character ranges (found on Google ):

2e80 ~ 33ffh: Symbol area of China, Japan, and South Korea. Reception of Kangxi Dictionary heads, China-Japan-South Korea auxiliary departments heads, phonetic symbols, Japanese Kana, Korean Notes, Chinese-Japan-South Korea symbols, punctuation marks, circled or including Rune numbers, months, and Japanese Kana combination, unit, year, month, date, and time.

3400 ~ 4 dffh: Japan and South Korea recognized the expansion of ideographic text area A, a total of 6,582 Chinese and Korean characters.

4e00 ~ 9 fffh: Japan and South Korea recognized the ideographic text area, a total of 20,902 Chinese and Korean characters.

A000 ~ A4ffh: Yi text area, which contains the texts and roots of Yi people in southern China.

Ac00 ~ D7ffh: A combination area of Korean and pinyin. It contains text in Korean Notes.

F900 ~ Faffh: compatible with ideographic text area, a total of 302 Chinese and Korean characters.

Fb00 ~ Fffdh: it is a text expression area that contains a combination of Latin characters, Hebrew characters, Arabic characters, Chinese-Japanese vertices, small characters, halfwidth characters, and fullwidth characters.

For example, to match all Chinese and Korean non-symbolic characters, the regular expression should be ^ [\ u3400-\ u9fff] + $
Theoretically, I copied a Korean file to MSN. co. Ko and found that it was not correct.
Copy A 'handler' to msn.co.jp ..

Then, expand the range to ^ [\ u2e80-\ u9fff] + $. This is all done. This should be the regular expression that matches the Chinese and Japanese characters, including traditional Chinese that we are still using blindly.

The regular expression for Chinese characters should be ^ [\ u4e00-\ u9fff] + $, which is very close to the ^ [\ u4e00-\ u9fa5] + $

Note that ^ [\ u4e00-\ u9fa5] + $ is a regular expression used to match simplified Chinese characters. In fact, traditional Chinese characters are also in the regular expression, I used the tester to test the 'central People's Republic of Korea 'and also passed the test. Of course, ^ [\ u4e00-\ u9fff] + $ is the same result.

II,Use the Unicode range of Chinese characters to verify whether the JavaScript function is a Chinese character

1. Function checkchinese (STR ){
VaR RE1 = new Regexp ("^ [\ u4e00-\ u9fa5] * $") // range of Chinese Characters
VaR re2 = new Regexp ("^ [\ ue7c7-\ ue7f3] * $ ")
VaR STR = Str. Replace (/(^ \ s *) | (\ s * $)/g ,'');
If (STR = '') {return false ;}
If (! (Re1.test (STR )&&(! Re2.test (STR )))){
Return false;
}
Return true;
}

3. How to determine whether it is a character
If (/[^ \ x00-\ xFF]/g. test (s ))
Alert ("containing Chinese characters ");
Else
Alert ("All characters ");

3. How to determine whether Chinese characters are contained
If (escape (STR). indexof ("% u ")! =-1)
Alert ("containing Chinese characters ");
Else
Alert ("All characters ");

4,

String. Prototype. existchinese = function ()
{
// [\ U4e00-\ u9fa5] is the escape character, and [\ ufe30-\ uffa0] is the fullwidth character.
Return/[\ u4e00-\ u9fa5]/. Test (this );
}

3. Other skills

1. Only Chinese characters and numbers can be entered. uppercase and lowercase English letters are supported. Only Chinese characters are allowed ,.! Regular Expression

If (/[^ \ u4e00-\ u9fa5 \ W, \.]/. Test (obj. Value) obj. value = ""; return false;
Function specchar (OBJ) {If (event. type = "keyup") {If (/[^ \ u4e00-\ u9fa5 \ W, \.] /. test (obj. value) obj. value = obj. value. substring (0, obj. value. length-1); Return false ;}}

Iv. Summary of Unicode programming in Windows

Unicode Environment Settings

When installing Visual Studio, you must add the Unicode option when selecting VC ++ to ensure that the relevant library files can be copied to system32.

Unicode compilation settings:

C/C ++, Preprocessor difinitions remove _ MBCS, add _ Unicode, Unicode

Set entry to wwinmaincrtstartup in projectsetting/link/output.

Otherwise, it is compiled for MBCS (ANSI.

UNICODE: Wide-Byte Character Set

1. How to obtain the number of characters in a string that contains both single-byte and double-byte characters?

You can call the Runtime Library of Microsoft Visual C ++ to contain the function _ mbslen to operate multi-byte strings (including single-byte and dual-byte strings.

Calling the strlen function does not really know how many characters are in the string. It only tells you how many bytes are before the end of 0.

2. How to operate on DBCS strings?

Function Description

Ptstr charnext (lpctstr); returns the address of the next character in the string

Ptstr charprev (lpctstr, lpctstr); returns the address of the previous character in the string.

Bool isdbcsleadbyte (byte); if this byte is the first byte of the DBCS character, a non-zero value is returned.

3. Why Unicode?

(1) It is easy to exchange data between different languages.

(2) enable you to allocate a single. EXE file or DLL file that supports all languages.

(3) improve the running efficiency of applications.

Windows 2000 is developed from scratch using Unicode. If you call any windows function and pass it an ANSI string, the system must first convert the string to Unicode, then, the Unicode string is passed to the operating system. If you want the function to return an ANSI string, the system first converts the Unicode string to an ANSI string and then returns the result to your application. To convert these strings, the system time and memory are required. By developing applications with Unicode from the beginning, you can make your applications run more effectively.

Windows CE itself is an operating system that uses Unicode and does not support ANSI Windows functions.

Windows 98 only supports ANSI and can only develop applications for ANSI.

When Microsoft converts com from a 16-bit windows to Win32, the company determines that all the COM interface methods that require strings can only accept Unicode strings.

4. How to compile Unicode source code?

Microsoft has designed windowsapi for Unicode to minimize the impact of code. In fact, you can write a single source code file to compile it with or without Unicode. You only need to define two macros (Unicode and _ Unicode) to modify and re-compile the source file.

_ Unicode macro is used for the C Runtime header file, while Unicode macro is used for the Windows header file. When compiling the source code module, these two macros must be defined at the same time.

5. What Unicode data types are defined in windows?

Data Type description

Wchar Unicode Character

Pwstr pointer to Unicode string

Pcwstr pointer to a constant Unicode string

The corresponding ANSI data types are char, lpstr, and lpcstr.

The Common Data Types of ANSI/Unicode are tchar, ptstr, and lpctstr.

6. How to operate Unicode?

Character Set feature instance

ANSI operation functions start with str strcpy

Unicode operation functions start with the WCS wcscpy

The MBCS operation function starts with _ MBS _ mbscpy

ANSI/Unicode operation functions start with _ TCS _ tcscpy (C Runtime Library)

ANSI/Unicode operation functions start with lstr lstrcpy (Windows function)

All new and outdated functions have both ANSI and Unicode versions in Windows2000. Functions of the ANSI version end with a, and functions of the Unicode version end with W. Windows will be defined as follows:

# Ifdef Unicode

# Define createmediawex createmediawexw

# Else

# Define createmediawex createmediawexa

# Endif //! Unicode

7. How do I represent Unicode string constants?

Character Set instance

ANSI "string"

Unicode L "string"

ANSI/Unicode T ("string") or _ text ("string") if (szerror [0] ==_ text ('J ')){}

8. Why should I try to use operating system functions?

Secret. Because these functions are used a lot, they may have been loaded into RAM when the application is running.

Such as strcat, strchr, strcmp, and strcpy.

9. How do I write ANSI and Unicode-compliant applications?

(1) treat a text string as a character array instead of a chars array or byte array.

(2) Use common data types (such as tchar and ptstr) for text characters and strings.

(3) Use explicit data types (such as byte and pbyte) for byte, byte pointer, and data cache.

(4) use the text macro for the original characters and strings.

(5) perform global replacement (for example, replace pstr with ptstr ).

(6) Modifying string operations. For example, a function usually needs to pass a cached size in characters, rather than bytes. This means that sizeof (szbuffer) should not be passed, but sizeof (szbuffer)/sizeof (tchar) should be passed ). In addition, if you need to allocate a memory block to the string and have the number of characters in the string, remember to allocate memory by byte. That is to say, you should call

Malloc (ncharacters * sizeof (tchar) instead of calling malloc (ncharacters ).

10. How to compare the selected strings?

It is implemented by calling comparestring.

Logo meaning

Norm_ignorecase ignores uppercase and lowercase letters

Norm_ignorekanatype does not distinguish hirakana from katakana

Norm_ignorenonspace ignore no delimiter

Norm_ignoresymbols ignore symbols

Norm_ignorewidth does not distinguish between single-byte characters and double-byte characters.

Sort_stringsort uses punctuation marks as common symbols.

11. How can I determine whether a text file is ANSI or Unicode?

If the first two bytes of a text file are 0xff and 0xfe, Unicode is used; otherwise, ANSI is used.

12. How can I determine whether a string is ANSI or Unicode?

Use istextunicode for determination. Istextunicode uses a series of statistical and qualitative methods to guess the cached content. Because this is not an exact scientific method, istextunicode may return incorrect results.

13. How to convert a string between Unicode and ANSI?

The Windows function multibytetowidechar is used to convert a multi-byte string to a wide string. The function widechartomultibyte converts a wide string to an equivalent multi-byte string.

14. Differences between Unicode and DBCS

Unicode (especially in the C programming language environment) "wide character set ". 「 Every character in Unicode is a 16-Bit Width, not an 8-Bit Width .」 In Unicode, there is no meaning to simply use an 8-bit value. In contrast, we still PROCESS 8-bit values in the double-bit character set. Some bit groups Define characters, while some bit groups show that one character needs to be defined together with another bit group.

Processing DBCS strings is messy, but processing Unicode text is like processing ordered text. You may be glad to know that the first 128 Unicode characters (16-bit code from 0x0000 to 0x007f) are ASCII characters, and the next 128 Unicode characters (code from 0x0080 to 0x00ff) is an ASCII extension of ISO 8859-1. The characters in different parts of Unicode are also based on the existing standards. This is to facilitate conversion. The Greek alphabet uses code from 0x0370 to 0x03ff, the codes from 0x0400 to 0x04ff are used in the Slavic language, the codes from 0x0530 to 0x058f are used in the United States, and the codes from 0x0590 to 0x05ff are used in the Hebrew language. Hieroglyphics (CJK) in China, Japan, and South Korea occupy code from 0x3000 to 0x9fff. The biggest advantage of Unicode is that there is only one character set and there is no ambiguity.

15. Derivative criteria

Unicode is a standard. UTF-8 is its conceptual subset, and UTF-8 is the specific coding standard. Unicode is the standard for all requests to meet the world's unified encoding standards. The UTF-8 standard is a form of deformation of Unicode (iso000046) standard,

UTF is: Unicode/UCOS Transformation format, in fact there are two UTF, one is the UTF-8, one is the UTF-16,

But the UTF-16 is used less, its correspondence is as follows:

In Unicode, the encoding format is 0 xxxxxxx in the UTF-8 encoded as 0000-007f

In Unicode, the encoding format is 0080 XXXXX 10 xxxxxx in the UTF-8 encoded as 110-07ff

In Unicode, the encoding format is: 0000 XXXX 10 xxxxxx 10 xxxxxx in the UTF-8 encoded as 1110-007f

UTF-8 is a new Unicode encoding standard. In fact, there are several Unicode standards. we know that the Unicode Character inner code used for a long time is 16 bits. In fact, it cannot compile all the characters in the world in a flat system, such as Chinese Tibetan and other small languages, therefore, UTF-8 is extended to 32 bits. That is to say, in theory, UTF-8 can contain the 2nd power character. unicode is designed to encode all characters in a unified way. big5 and GB are independent character sets. This is also called the Far East character set. If you get it to the German version of Windows, it may cause character encoding conflicts .... in earlier windows, the default character set was ANSI. the Chinese characters entered in Notepad are locally encoded, But Unicode can be directly supported within NT/2000. Notepad.exe is an ANSI character in Win95 and 98, and Unicode in NT. ANSI and Unicode can easily implement ing, that is, converting ASCII is an eight-bit character set, which cannot be expressed for characters outside the range, such as Chinese characters. Unicode is a character set within the 16-bit range. For character Partition Distribution in different regions, Unicode is a character encoding standard jointly developed by multiple it giants. In Unicode environments such as Windows NT, a single character occupies 16 bytes, while in ANSI environments such as Windows 98, the next character occupies 8 bytes. the Unicode character is a 16-bit width, which can contain a maximum of 65,535 characters. The data type is called wchar.

For existing ANSI characters, Unicode simply expands to 16 bits: for example, ANSI "A" = 0x43, the corresponding Unicode is

"A" = 0x0043

ASCII is used to store 128 characters in seven days. ASCII is a real American standard, so it cannot meet the needs of other countries. For example, the characters and characters in the Slavic language appear in the Windows ANSI character set, it is an extended ASCII code. It stores 8 characters, and the original ASCII code is still stored at a low of 128 characters,

The 128 characters in height are added with Greek letters.

If def Unicode

Tchar = wchar

Else

Tchar = char

You need to add Unicode and _ Unicode to project \ Settings \ c/C ++ \ preprocesser Definitions

Uincode, _ Unicode must be defined. If _ Unicode is not defined, settext (hwnd, lpctstr) will be interpreted as settexta (hwnd, lptstr). In this case, the API regards the Unicode string you give as an ANSI string, garbled characters are displayed. Because Windows APIs are compiled and stored in DLL, both Unicode and ANSI strings are considered as a buffer, for example, "0b A3 00 35 24 3C 00 00" if it is read by ANSI, because the ANSI string ends with '\ 0', it can only read two bytes "0b A3 \ 0 ", if you read data by Unicode, the complete read ends with '\ 0 \ 0.

Since Unicode does not have any additional indications, the system must know the format of the string you provide. In addition, Unicode seems to be specified by ansi c ++, and _ Unicode is provided by the Windows SDK. If you do not write a Windows program, you can only define Unicode.

Development Process:

It is centered on file read/write and string processing. There are two main types of files:. txt and. ini files.

1. for different processing of strings in Unicode and non-Unicode environments, refer to the preceding 9 and 10 to meet the character string processing requirements in different environments.

The same is true for reading and writing files. When you call related interface functions, add _ text and other related macros to the strings in the parameter. If the file to be written must be saved in unicode format, you must add a byte header when creating the file.

Cfile file;

Wchar szwbuffer [128];

 

Wchar * pszunicode = l "Unicode string \ n"; // Unicode string

Char * pszansi = "ANSI string \ n"; // ANSI string

Word wsignature = 0 xfeff;

 

File. Open (text ("test.txt"), cfile: modecreate | cfile: modewrite );

 

File. Write (& wsignature, 2 );

 

File. Write (pszunicode, lstrlenw (pszunicode) * sizeof (wchar ));

// Explicitly use lstrlenw Function

 

Multibytetowidechar (cp_acp, 0, pszansi,-1, szwbuffer, 128 );

 

File. Write (szwbuffer, lstrlenw (szwbuffer) * sizeof (wchar ));

 

File. Close ();

// The code above is valid in Unicode and non-Unicode environments. Here, Unicode is explicitly used for operations.

2. In non-Unicode environments, all strings in ANSI format are called by default. In this case, tchar is converted to char type unless wchar is explicitly defined. Therefore, in this environment, if you want to read a Unicode file, you must first move two bytes. Then, you must use multibytetowidechar to convert the read string. After conversion, the string information indicates Unicode data.

3. In the Unicode environment, all strings in unicode format are called by default, that is, wide characters. At this time, tchar is converted to wchar, and related API functions also call wide character functions. At this time, the Unicode file is read in the same way as the preceding one, but the data obtained by reading is wchar. to convert it to ANSI format, you need to call widechartomultibyte. If you want to read ANSI data, you do not need to move two bytes to directly read the data and convert the data as needed.

Some languages (such as Korean) must be displayed in the Unicode environment. In this case, the development in a non-Unicode environment cannot achieve the purpose of displaying text even if string function conversion is used, because at this time, the API function is called using ANSI (although the underlying layer uses Unicode processing, the processing result is displayed according to the API called by the programmer ). Therefore, Unicode must be used for development.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.