Multi-Byte Character Set and Unicode Character Set

Last Update:2014-10-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In a computer, the character is generally not saved as an image. Each character is represented by an encoding, And the encoding used by each character depends on the character set used ).

Multi-Byte Character Set:

At the beginning, there was only one character set on the Internet-ANSIASCII character setIt uses 7 bits to represent a single character. It represents a total of 128 characters, including common characters such as English letters, numbers, and punctuation marks. Then, it is extended. 8 bits represents a single character and can represent 256 characters. Some special characters are added to the original 7 bits character set. Later, due to the addition of languages from various countries, ASCII was no longer able to meet the needs of information exchange. To express the texts of other countries, countries developed their own character sets based on ASCII, these character sets derived from ANSI are commonly referred toANSI character setAnd their formal names should beMBCS (Multi-byte chactacter system, that is, multi-Byte Character System). These derived character sets are based on ASCII 127 bits and compatible with ASCII 127. They use encoding greater than 128 as a leading byte, followed by the second (or even third) after leading byte) character and leading byte are used as the actual encoding. There are many such character sets.GB-2312Is one of them.

Unicode Character Set:

The Unicode name is "Universal multiple-octet coded character set", abbreviatedUCs. UCOS can be seen as the abbreviation of "Unicode Character Set. UCOS only specifies how to encode and does not specify how to transmit and save the encoding.UTFIs the abbreviation of "UCS Transformation Format.

The Unicode character set can be encoded in multiple formats. It uses 16 bits (two bytes, one word) to represent a single character, which can represent 65536 characters in total. Including common characters in almost all languages in the world, facilitating information exchange. The standard Unicode is calledUTF-16. Later, in order to enable the dual-byte Unicode to be correctly transmitted on the existing system for processing a single byteUTF-8(Note that the UTF-8 is encoding, which belongs to the Unicode Character Set) and uses a method similar to MBCS to encode Unicode. The UTF-8 is encoded in bytes and there is no issue of bytecode. The UTF-16 is encoded in two bytes.

UTF-16 consists of three types: UTF-16, UTF-16BE (big endian), UTF-16LE (little endian), UTF-16 needs to be done by starting with a file named BOM (byte order mark) to indicate whether the file is big endian or little endian. The recommended method for marking byte order in Unicode specifications isBOM(Byte order mark ). There is a file named"Zero Width no-break Space", Which is encodedFeff. Fffe does not exist in the UCs, so it should not appear in actual transmission. We recommend that you transmit the character "Zero Width no-break space" before transmitting the byte stream in the UCS specification ". In this way, if the receiver receives feff, it indicates that the byte stream isBig-EndianIf fffe is received, the byte stream isLittle-Endian. Therefore, the character "Zero Width no-break space" is also called Bom.

The UTF-8 does not need BOM to indicate the byte order, but BOM can be used to indicate the encoding method. The UTF-8 code for the character "Zero Width no-break space" is ef bb bf (the reader can verify it with the encoding method we described earlier ). So if the receiver receives a byte stream starting with ef bb bf, it will know that this is UTF-8 encoding.

　　Windows uses BOM to mark the encoding of text files.

　　LIs used to mark a character (string)Wide character (string)When you are working in an IDE version later than vs2005, you can choose to work in these two different encoding methods, while in Unicode mode, you need to pair the character (string) add L before the constant to tell the compiler that it is a wide character. Ms defines several related macros for us: _ T (defined in tchar. h) and _ text (also defined in tchar. h ).

　　 Text (): If Unicode is defined, the identifier character is Unicode; otherwise, it is an ANSI character set. Use the Unicode Character Set in vs2010:

// MessageBox ("test"); // error // MessageBox (_ T ("test"); // MessageBox (text ("test ")); messageBox (_ text ("test "));

　　Why Unicode?(Refer to Windows core programming)
We strongly recommend that you use Unicode characters and strings when developing applications for the following reasons:

Unicode makes localization easier;
To use unicode, you only need to prepare a file (.exe or DLL) to support all languages;
Unicode code is executed faster and consumes less memory, improving the efficiency of applications. Since windows 2 K, Windows kernel fully supports Unicode writing and all ANSI characters are converted to Unicode by corresponding APIs before entering the underlying layer. Therefore, if Unicode is used at the beginning, the conversion time and ram overhead can be reduced.
With Unicode, your application can easily call all Windows functions that are not opposed to using (nondeprecated), because some Windows functions provide versions that can only process Unicode characters and strings;
With Unicode, your code is easily integrated with COM (the latter requires Unicode characters and strings );
With Unicode, your code is easily integrated with. NET Framework (the latter requires Unicode characters and strings );
Unicode ensures that your code can easily manipulate your own resources (the strings are always Unicode );
The character sets used by most programs in the world are Unicode, because Unicode is conducive to program internationalization and standardization;

　　Conversion between wchar_t and char:

# Include <iostream> # include <windows. h> using namespace STD; Class cuser {public: cuser (); Virtual ~ Cuser (); char * wchartochar (wchar_t * WC); // convert a wide byte to a single byte wchar_t * chartowchar (char * C ); // single-byte to wide byte void release (); // release resource PRIVATE: char * m_char; wchar_t * m_wchar ;}; //////////////////////////////////////// //////////////////////////////////////// ///// * character type wchar_t Char/* Get the length of wcslen () strlen ()/* connect two strings wcscat () strcpy ()/* Copy string wcscpy () strcpy ()/* compare two strings wcscmp () strcmp () /* for specific parameters, see www.linuxidc.com */// //////////////////////////////////////// //////////////////////////////////////// // Cuser:: cuser (): m_char (null), m_wchar (null) {} cuser ::~ Cuser () {release () ;}// convert the wide byte to a single byte char * cuser: wchartochar (wchar_t * WC) {release (); int Len = widechartomultibyte (cp_acp, 0, WC, wcslen (WC), null, 0, null, null); m_char = new char [Len + 1]; widechartomultibyte (cp_acp, 0, WC, wcslen (WC), m_char, Len, null, null); m_char [Len] = '\ 0'; return m_char;} // convert a single byte to a wide byte wchar_t * cuser :: chartowchar (char * c) {release (); int Len = multibytetowidechar (cp_acp, 0, C, strlen (C), null, 0 ); m_wchar = new wchar_t [Len + 1]; multibytetowidechar (cp_acp, 0, C, strlen (C), m_wchar, Len); m_wchar [Len] = '\ 0 '; return m_wchar;} // release the resource void cuser: release () {If (m_char) {Delete m_char; m_char = NULL;} If (m_wchar) {Delete m_wchar; m_wchar = NULL ;}}

Use:

WCHAR* wc;  CUser u;  char* c=u.WcharToChar(wc);  cout<<c<<endl;

Multi-Byte Character Set and Unicode Character Set

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Multi-Byte Character Set and Unicode Character Set

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Multi-Byte Character Set and Unicode Character Set

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support