Sender: Gege (I am making a fortune), email area: C ++ Title: Unicode Programming Mailing station: Ethereal cloudification room (Thu Apr 25 21:00:30 2002), internal mail This is a question that many people (including myself) have or are still confused about (Here we only discuss the UTF-16, that is, the dual-byte version ). 1. About Unicode First, Unicode mainly uses the wchar character type, which is defined as unsigned short. We can see from the definition that this is a double byte type, that is, each character occupies 2 bytes. In this way, up to 60 thousand character types can be represented. All previous ASCII codes are distributed between 0x0000-0x00ff, while Chinese characters (including Big5) is distributed between 0x4e00 and 0x9fff. Unicode contains almost all the text in the world. For more information about Unicode, see the following webpage. Http://www.unicode.org/unicode/standard/translations/s-chinese.html 2. Why Unicode? 1) COM: the Unicode type must be specified in the COM specification, which is exactly the cross-platform result that Microsoft fully considers. This is why the BSTR (wchar *) type is often seen in COM. 2) Win2000 and WinNT: in these two platforms, the default character processing method is Unicode. Even if you write a non-Unicode (multibyte) program, the system will still perform a conversion of your characters during execution, which will undoubtedly waste CPU time, unicode can effectively improve the program running efficiency (only used on these two platforms ). Of course, this will happen to XP in the future. 3) versatility: Unicode allows us not to worry about Chinese characters and English characters (both two bytes ). 3. How to use Unicode 1) the recommended type is tchar (general character type ). When you define _ Unicode macro, tchar is wchar. If you do not define this macro, tchar is Char, which is incredible. Let's take a look at the definition of tchar: # Ifdef Unicode // r_winnt Typedef wchar tchar, * ptchar; # Else/* Unicode * // r_winnt Typedef char tchar, * ptchar; # Endif /*! _ Tchar_defined */ The above Code comes from winnt. h. I have removed some irrelevant parts. Now everything is clear. With tchar, we only need the following code: Tchar tstr [] = _ T ("T code "); MessageBox (tstr ); Unicode and multibyte versions are supported. _ T macro is used to convert to tchar. 2) about other processing The first is the commonly used cstring, which itself supports Unicode. The following example illustrates the usage: Cstring * pfilename = new cstring ("C: // tmpfile.txt "); # Ifdef _ Unicode M_hfile = createfile (pfilename-> allocsysstring (), Generic_read | generic_write, File_pai_read, Null, Open_existing, File_attribute_normal, Null ); # Else M_hfile = createfile (pfilename-> getbuffer (pfilename-> getlength ()), Generic_read | generic_write, File_pai_read, Null, Open_existing, File_attribute_normal, Null ); # Endif 3) when we need to attach a value to a String constant in Unicode mode, we can use an L macro, such: BSTR wcsstr = l "Unicode "; Such value attachment is simple, but the string processed by L macro must be Unicode. if you attach it to a multibyte string, the character may be truncated. In addition, VC also provides some functions such as widechartomultibyte and multibytetowidechar, and some other macros to support conversion. You can refer to msdn. 3. compiler settings: First, we need to write _ Unicode in Preprocessor on the property page of project-> Settings-> C/C ++, and then select output in category on the Link property page, add wwinmaincrtstartup to entry-point symbol, so that our Unicode project is complete. Sender: olddog (Wang Wangwang), email area: C ++ Title: Re: Unicode Programming Mailing station: Ethereal cloud (Thu Apr 25 21:44:26 2002), Email Forwarding Add: Character sets include Unicode, acsii, MBCS, etc. Unicde is an extension of ASCII and is encoded with 16 characters. MBCS is a substitute for Unicode. One or two bytes (bytes) can be used to represent characters. Use two BYT E, the first byte is lead-byte, which indicates that the next two bytes represent one character. The Lead-byte indicates the combination of different character sets (code page). For example, the n1-n2 indicates that it is Japanese, and the n3-n4 indicates Is a Chinese character. If the program has been released internationally, MBCS or Unicode should be used, or the program can be modeled in multiple modes. . DBCS is the most common case for MBCS. When using wide characters, pay attention to the following: 1. File name 2. Character operation (delete, right direction key move a character...) 3. String Length 4. program entry functions CRT and MFC support single-byte, MBCS, and Unicode String processing functions are generally divided into the following versions: Str... single-byte _ MBS MBCS WCS Unicode The class member functions of MFC are generally transplanted functions _..... Portability between three types of Characters The prefix _ TCS in tchar. h is used to unify the three string processing functions and define different macro Switches during compilation. You can choose the compilation method as needed. In tchar. H, macro _ tchar is defined. When compiled according to Unicode, It is wchar_t and _ Code or MBCS is Char during compilation In general, we use the TCS... function to operate _ tchar. Differences between Unicode and mcbs; Unicode cannot be used under 95 on win_nt and win_2k platforms (the string must be 16-bit/character) Mcbs any Win32 platform (each character can be 1 or 2 bytes) |