C + + UTF8 encoding conversion Cchinesecode

Source: Internet
Author: User
Tags scalar

A pre-knowledge
1, character: The character is the smallest literal unit of abstraction. It has no fixed shape (possibly a glyph) and has no value. "A" is a character, "€" (a symbol of the general currency of Germany, France and many other European countries) is also a character. "China" is a two kanji character. The character represents only one symbol, without any meaning of the actual value.
2, Character set: The character set is a collection of characters. For example, Chinese characters are the first characters invented by the Chinese, and are used in Chinese, Japanese, Korean, and Vietnamese writings. This also illustrates the relationship between character and character set, character composition character set (Iso8859-1,gb2312/gbk,unicode).
3, code point: Each character in the character set is assigned to a "code point." Each code point has a specific unique numeric value, called a label. The scalar value is usually expressed in hexadecimal.
4, code unit: In each encoding form, the code point is mapped to one or more code units. A "code unit" is a single unit in each encoding method. The size of the code unit is equivalent to the number of bits in a particular encoding:
UTF-8: The code unit in the UTF-8 consists of 8 bits, and in UTF-8, each code point is often mapped to multiple code units because of the small size of the code unit. Code points are mapped to one, two, three, or four code units;
The code unit in utf-16:utf-16 consists of 16 bits, and the UTF-16 code unit size is twice times the 8-bit code unit. Therefore, a code point with a scalar value less than u+10000 is encoded into a single unit of code;
The code unit in UTF-32:UTF-32 consists of 32 bits, and the 32-bit code units used in the UTF-32 are large enough to encode each code point into a single unit of code;
The code unit in gb18030:gb18030 consists of 8 bits, and in GB18030, each code point is often mapped to multiple code units because of the smaller code units. Code points are mapped to one, two, or four units of code.
5, for example:
"China Beijing Banana is a big idiot" This is my definition of the aka character set; the corresponding code points for each character are:
North 00000001
Beijing 00000010
Incense 10000001
Banana 10000010
Is 10000100
A 10001000
Big 10010000
Stupid 10100000
Egg 11000000
Medium 00000100
Country 00001000
Here is my definition of the Zixia encoding scheme (8-bit), you can see that its encoding represents the AKA character set of all the characters corresponding to the code unit;
North 10000001
Beijing 10000010
Incense 00000001
Banana 00000010
Is 00000100
A 00001000
Big 00010000
Stupid 00100000
Egg 01000000
Medium 10000100
Country 10001000
The so-called text file is that we are encoding the binary data as a corresponding text such as 00000001000000100000010000001000000100000010000001000000 of such a file. I opened it with a notepad that supports Zixia encoding and the AKA character set, and it shows as "banana is a big idiot" according to the coding scheme.
If I save one of these characters according to GBK, then it's definitely not this, but
1100111111100011 1011110110110110 1100101011000111 1011100011110110 1011010011110011 1011000110111111 1011010110110000 110100001010
two, Character set
1, common character set classification
ASCIIand its extended character set
Role: predicative English and Western European languages.
Number of digits: ASCII is represented by 7 bits and can represent 128 characters, and its extension uses 8-bit notation, representing 256 characters.
Range: ASCII from 00 to 7F, extended from 00 to FF.
iso-8859-1Character
Function: Extended ASCII, representing Western Europe, Greek, etc.
Number of digits: 8 bits,
Range: from 00 to FF, compatible with the ASCII character set.
GB2312Character
Role: National Simplified Chinese character set, compatible with ASCII.
Number of digits: represented by 2 bytes, can represent 7,445 symbols, including 6,763 kanji, almost all high-frequency Chinese characters.
Range: High byte from A1 to F7, low byte from A1 to FE. The high-and low-byte are encoded by adding 0xa0 to each other.
BIG5Character
Function: Unify traditional Chinese characters encoding.
Number of digits: represented by 2 bytes, representing 13,053 kanji.
Range: High byte from A1 to F9, low byte from 40 to 7E,A1 to FE.
GBKCharacter
Role: It is an extension of GB2312, adding support for traditional characters, compatible with GB2312.
Number of digits: 2 bytes, representing 21,886 characters.
Range: High byte from 81 to Fe, low byte from 40 to FE.
GB18030Character
Function: It solves the encoding of Chinese, Japanese, Korean, etc., and is compatible with GBK.
Number of bits: It takes a variable byte representation (1 ascii,2,4 bytes). can represent 27,484 words.
Range: 1 bytes from 00 to 7F; 2 bytes High bytes from 81 to Fe, low bytes from 40 to 7E and 80 to fe;4 bytes 13th bytes from 81 to Fe, 24th bytes from 30 to 39.
UCSCharacter
Role: The International standard ISO 10646 defines the universal Character set (Universal Character set). It is compatible with Unicode-homogeneous organizations, UCS-2, and Unicode.
Number of digits: it has UCS-2 and UCS-4 two formats, 2 bytes and 4 bytes, respectively.
Scope: At present, UCS-4 only in front of UCS-2 added 0x0000.
UNICODECharacter
Function: Unified coding for 650 languages of the world, compatible with iso-8859-1.
Number of digits: The Unicode character set is encoded in multiple ways, utf-8,utf-16 and UTF-32, respectively.
2, according to the words of the classification
Language Character Set formal name
English, Western European ascii,iso-8859-1 MBCS multibyte
Simplified Chinese GB2312 MBCS Multi-byte
Traditional Chinese BIG5 MBCS multi-byte
Simplified Chinese GBK MBCS Multi-byte
Chinese, Japanese, and Korean GB18030 MBCS multibyte
National languages Unicode,ucs DBCS wide bytes
Three, code
UTF-8: Using variable length bytes (1 ASCII, 2 Greek letters, 3 kanji, 4 plane symbols), the network transmission, even if the wrong one byte, does not affect the other bytes, and the double byte as long as one wrong, the other is wrong, as follows:
If there is only one byte, its maximum bits is 0, and if it is multibyte, its first byte starts at the highest bit, and the number of consecutive bits values is 1, which determines the number of bytes encoded, and the remaining bytes begin with 10. The UTF-8 can be up to 6 bytes.
UTF-16: With 2 bytes, characters from different parts of Unicode are also based on existing standards. This is for ease of conversion. From 0x0000 to 0x007f is the ASCII character, from 0x0080 to 0x00ff is the extension of iso-8859-1 to ASCII. The Greek alphabet uses code from 0x0370 to 0x03ff, Slavic uses code from 0x0400 to 0X04FF, the United States uses code from 0x0530 to 0x058f, and Hebrew uses code from 0x0590 to 0X05FF. Chinese, Japanese, and Korean hieroglyphs (collectively called CJK) occupy code from 0x3000 to 0X9FFF, because 0x00 has special meanings in C and operating system filenames, so in many cases it is necessary to save the text UTF-8 encoding, removing this 0x00. Examples are as follows:
utf-16:0x0080 = 0000 0000 1000 0000
UTF-8:0xc280 = 1100 0010 1000 0000
UTF-32: Takes 4 bytes.
Advantages and Disadvantages
Both UTF-8, UTF-16, and UTF-32 can represent all Unicode characters in a valid encoding space (U+000000-U+10FFFF).
Using UTF-8 encoding, ASCII characters only account for 1 bytes, storage efficiency is high, suitable for more Latin characters to save space.
For most non-Latin characters (such as Chinese and Japanese), the UTF-16 requires minimal storage space and only 2 bytes per character.
The Windows NT kernel is Unicode (UTF-16) and uses UTF-16 encoding to invoke the system API without conversion and processing speed.
With UTF-16 and UTF-32 there will be big endian and little endian points, and UTF-8 has no byte order problem, so UTF-8 is suitable for transport and communication.
UTF-32 uses 4-byte encoding, on the one hand processing speed is relatively fast, but on the other hand also wasted a lot of space, affecting the transmission speed, and thus rarely used.
four, how to determine the character set
1, byte order
First of all, the effect of byte order on encoding, byte order is divided into big endian byte order and little endian byte order. Different processors may not be the same. Therefore, it is necessary to tell the processor the encoded byte order at the time of transmission. For the former, the high-bit byte has a low address, the low byte is stored in the high address, the latter is the opposite. For example, 0x03ab,
Big endian byte order
0000:0 3
0001:ab
Little endian byte order is
0000:ab
0001:0 3
2, code recognition
Unicode, which can be used to determine the various encodings of the Unicode character set based on the first few bytes, is called the byte Order mask method BOM:
UTF-8: EFBBBF (conforms to UTF-8 format, see above.) But no meaning in UCS is Unicode)
UTF-16 Big Endian:feff (no meaning in UCS-2)
UTF-16 Little Endian:fffe (no meaning in UCS-2)
UTF-32 Big Endian:0000feff (no meaning in UCS-4)
UTF-32 Little endian:fffe0000 (no meaning in UCS-4)
GB2312: The 1th bit of both high and low bytes is 1.
big5,gbk&gb18030: The 1th bit of the high byte is 1. The operating system has a default encoding, often GBK, and can be downloaded and upgraded.

By judging the 1th bit of the high byte thereby know is ASCII or Chinese character coding.

//class declaration classes cchinesecode{public:static void Utf_8tounicode (wchar_t* pout,char *pText);   Convert UTF-8 to Unicode static void Unicodetoutf_8 (char* pout,wchar_t* ptext);   Unicode converted to UTF-8 static void UnicodeToGB2312 (char* pout,wchar_t uData); Converts Unicode to GB2312 static void Gb2312tounicode (wchar_t* pout,char *gbbuffer); GB2312 converted to Unicode static void Gb2312toutf_8 (string& pout,char *ptext, int plen); GB2312 to UTF-8 static void utf_8togb2312 (String &pout, char *ptext, int plen),//utf-8 to GB2312}; 
class implements void Cchinesecode::utf_8tounicode (wchar_t* pout,char *ptext) {char* Uchar = (char *) pOut;   UCHAR[1] = ((ptext[0] & 0x0F) << 4) + ((Ptext[1] >> 2) & 0x0F);   Uchar[0] = ((ptext[1] & 0x03) << 6) + (Ptext[2] & 0x3F); return;} void Cchinesecode::unicodetoutf_8 (char* pout,wchar_t* ptext) {//Note WCHAR The order of high and low characters, lower byte in front, higher byte after char* pchar = (char *) PTex   T Pout[0] = (0xE0 |   ((Pchar[1] & 0xF0) >> 4)); POUT[1] = (0x80 | ((Pchar[1] & 0x0F) << 2))   + ((pchar[0] & 0xC0) >> 6); POUT[2] = (0x80 |   (Pchar[0] & 0x3F)); return;} void cchinesecode::unicodetogb2312 (char* pout,wchar_t uData) {WideCharToMultiByte (cp_acp,null,&udata,1,pout,   sizeof (wchar_t), null,null); return;} void Cchinesecode::gb2312tounicode (wchar_t* Pout,char *gbbuffer) {:: MultiByteToWideChar (cp_acp,mb_precomposed,   gbbuffer,2,pout,1); return;}   void Cchinesecode::gb2312toutf_8 (string& pout,char *ptext, int plen) {char buf[4]; int nlength = Plen* 3;   char* rst = new Char[nlength];   memset (buf,0,4);   memset (rst,0,nlength);   int i = 0;   int j = 0;  while (I < Plen) {//If direct copy in English is possible if (* (Ptext + i) >= 0) {rst[j++]           = ptext[i++];                   } else {wchar_t pbuffer;                   Gb2312tounicode (&pbuffer,ptext+i);                   Unicodetoutf_8 (Buf,&pbuffer);                   unsigned short int tmp = 0;                   TMP = Rst[j] = buf[0];                   TMP = rst[j+1] = buf[1];                   TMP = rst[j+2] = buf[2];                   j + = 3;           i + = 2;   }} Rst[j] = ";   Returns the result pOut = rst;   delete []rst; return;}   void cchinesecode::utf_8togb2312 (String &pout, char *ptext, int plen) {char * newbuf = new Char[plen];   Char ctemp[4];   memset (ctemp,0,4);   int i = 0;   int j = 0; while (I < Plen) {if (Ptext > 0) {newbuf[j++] = ptext[i++];               } else {WCHAR wtemp;               Utf_8tounicode (&wtemp,ptext + i);               UnicodeToGB2312 (ctemp,wtemp);               NEWBUF[J] = ctemp[0];               Newbuf[j + 1] = ctemp[1];               i + = 3;       J + = 2;   }} Newbuf[j] = ";   POut = Newbuf;   delete []newbuf; return;}





C + + UTF8 encoding conversion Cchinesecode

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.