A pre-knowledge
1, Character: character is the smallest text unit of abstraction. It has no fixed shape (possibly a glyph) and has no value. "A" is a character, "€" (a symbol of the general currency of Germany, France and many other European countries) is also a character. "China" is a two kanji character. The character represents only one symbol, without any meaning of the actual value.
2, Character set: The character set is a collection of characters. For example, Chinese characters are the first characters invented by the Chinese, and are used in Chinese, Japanese, Korean, and Vietnamese writings. This also illustrates the relationship between character and character set, character composition character set (Iso8859-1,gb2312/gbk,unicode).
3, code point: Each character in the character set is assigned to a "code point." Each code point has a specific unique numeric value, called a label. The scalar value is usually expressed in hexadecimal.
4, code unit: In each encoding form, the code point is mapped to one or more code units. A "code unit" is a single unit in each encoding method. The size of the code unit is equivalent to the number of bits in a particular encoding:
UTF-8: The code unit in the UTF-8 consists of 8 bits, and in UTF-8, each code point is often mapped to multiple code units because of the small size of the code unit. Code points are mapped to one, two, three, or four code units;
The code unit in utf-16:utf-16 consists of 16 bits, and the UTF-16 code unit size is twice times the 8-bit code unit. Therefore, a code point with a scalar value less than u+10000 is encoded into a single unit of code;
The code unit in UTF-32:UTF-32 consists of 32 bits, and the 32-bit code units used in the UTF-32 are large enough to encode each code point into a single unit of code;
The code unit in gb18030:gb18030 consists of 8 bits, and in GB18030, each code point is often mapped to multiple code units because of the smaller code units. Code points are mapped to one, two, or four units of code.
5, for example:
"China Beijing Banana is a big idiot" This is my definition of the aka character set; the corresponding code points for each character are:
North 00000001
Beijing 00000010
Incense 10000001
Banana 10000010
Is 10000100
A 10001000
Big 10010000
Stupid 10100000
Egg 11000000
Medium 00000100
Country 00001000
Here is my definition of the Zixia encoding scheme (8-bit), you can see that its encoding represents the AKA character set of all the characters corresponding to the code unit;
North 10000001
Beijing 10000010
Incense 00000001
Banana 00000010
Is 00000100
A 00001000
Big 00010000
Stupid 00100000
Egg 01000000
Medium 10000100
Country 10001000
The so-called text file is that we are encoding the binary data as a corresponding text such as 00000001000000100000010000001000000100000010000001000000 of such a file. I opened it with a notepad that supports Zixia encoding and the AKA character set, and it shows as "banana is a big idiot" according to the coding scheme.
If I save one of these characters according to GBK, then it's definitely not this, but
1100111111100011 1011110110110110 1100101011000111 1011100011110110 1011010011110011 1011000110111111 1011010110110000 110100001010
two, Character set
1, common character set classification
ASCII and its extended character set
Role: predicative English and Western European languages.
Number of digits: ASCII is represented by 7 bits and can represent 128 characters, and its extension uses 8-bit notation, representing 256 characters.
Range: ASCII from 00 to 7F, extended from 00 to FF.
Iso-8859-1 Character Set
Function: Extended ASCII, representing Western Europe, Greek, etc.
Number of digits: 8 bits,
Range: from 00 to FF, compatible with the ASCII character set.
GB2312 Character Set
Role: National Simplified Chinese character set, compatible with ASCII.
Number of digits: represented by 2 bytes, can represent 7,445 symbols, including 6,763 kanji, almost all high-frequency Chinese characters.
Range: High byte from A1 to F7, low byte from A1 to FE. The high-and low-byte are encoded by adding 0xa0 to each other.
BIG5 Character Set
Function: Unify traditional Chinese characters encoding.
Number of digits: represented by 2 bytes, representing 13,053 kanji.
Range: High byte from A1 to F9, low byte from 40 to 7E,A1 to FE.
GBK Character Set
Role: It is an extension of GB2312, adding support for traditional characters, compatible with GB2312.
Number of digits: 2 bytes, representing 21,886 characters.
Range: High byte from 81 to Fe, low byte from 40 to FE.
GB18030 Character Set
Function: It solves the encoding of Chinese, Japanese, Korean, etc., and is compatible with GBK.
Number of bits: It takes a variable byte representation (1 ascii,2,4 bytes). can represent 27,484 words.
Range: 1 bytes from 00 to 7F; 2 bytes High bytes from 81 to Fe, low bytes from 40 to 7E and 80 to fe;4 bytes 13th bytes from 81 to Fe, 24th bytes from 30 to 39.
UCS Character Set
Role: The International standard ISO 10646 defines the universal Character set (Universal Character set). It is compatible with Unicode-homogeneous organizations, UCS-2, and Unicode.
Number of digits: it has UCS-2 and UCS-4 two formats, 2 bytes and 4 bytes, respectively.
Scope: At present, UCS-4 only in front of UCS-2 added 0x0000.
Unicode character Set
Function: Unified coding for 650 languages of the world, compatible with iso-8859-1.
Number of digits: The Unicode character set is encoded in multiple ways, utf-8,utf-16 and UTF-32, respectively.
2, according to the words of the classification
Language Character Set formal name
English, Western European ascii,iso-8859-1 MBCS multibyte
Simplified Chinese GB2312 MBCS Multi-byte
Traditional Chinese BIG5 MBCS multi-byte
Simplified Chinese GBK MBCS Multi-byte
Chinese, Japanese, and Korean GB18030 MBCS multibyte
National languages Unicode,ucs DBCS wide bytes
Three, code
UTF-8: Using variable length bytes (1 ASCII, 2 Greek letters, 3 kanji, 4 plane symbols), the network transmission, even if the wrong one byte, does not affect the other bytes, and the double byte as long as one wrong, the other is wrong, as follows:
If there is only one byte, its maximum bits is 0, and if it is multibyte, its first byte starts at the highest bit, and the number of consecutive bits values is 1, which determines the number of bytes encoded, and the remaining bytes begin with 10. The UTF-8 can be up to 6 bytes.
UTF-16: With 2 bytes, characters from different parts of Unicode are also based on existing standards. This is for ease of conversion. From 0x0000 to 0x007f is the ASCII character, from 0x0080 to 0x00ff is the extension of iso-8859-1 to ASCII. The Greek alphabet uses code from 0x0370 to 0x03ff, Slavic uses code from 0x0400 to 0X04FF, the United States uses code from 0x0530 to 0x058f, and Hebrew uses code from 0x0590 to 0X05FF. Chinese, Japanese, and Korean hieroglyphs (collectively called CJK) occupy code from 0x3000 to 0X9FFF, because 0x00 has special meanings in C and operating system filenames, so in many cases it is necessary to save the text UTF-8 encoding, removing this 0x00. Examples are as follows:
utf-16:0x0080 = 0000 0000 1000 0000
UTF-8:0xc280 = 1100 0010 1000 0000
UTF-32: Takes 4 bytes.
Advantages and Disadvantages
Both UTF-8, UTF-16, and UTF-32 can represent all Unicode characters in a valid encoding space (U+000000-U+10FFFF).
Using UTF-8 encoding, ASCII characters only account for 1 bytes, storage efficiency is high, suitable for more Latin characters to save space.
For most non-Latin characters (such as Chinese and Japanese), the UTF-16 requires minimal storage space and only 2 bytes per character.
The Windows NT kernel is Unicode (UTF-16) and uses UTF-16 encoding to invoke the system API without conversion and processing speed.
With UTF-16 and UTF-32 there will be big endian and little endian points, and UTF-8 has no byte order problem, so UTF-8 is suitable for transport and communication.
UTF-32 uses 4-byte encoding, on the one hand processing speed is relatively fast, but on the other hand also wasted a lot of space, affecting the transmission speed, and thus rarely used.
four, how to determine the character set
1, byte order
First of all, the effect of byte order on encoding, byte order is divided into big endian byte order and little endian byte order. Different processors may not be the same. Therefore, it is necessary to tell the processor the encoded byte order at the time of transmission. For the former, the high-bit byte has a low address, the low byte is stored in the high address, the latter is the opposite. For example, 0x03ab,
Big endian byte order
0000:0 3
0001:ab
Little endian byte order is
0000:ab
0001:0 3
2, code recognition
Unicode, which can be used to determine the various encodings of the Unicode character set based on the first few bytes, is called the byte Order mask method BOM:
UTF-8: EFBBBF (conforms to UTF-8 format, see above.) But no meaning in UCS is Unicode)
UTF-16 Big Endian:feff (no meaning in UCS-2)
UTF-16 Little Endian:fffe (no meaning in UCS-2)
UTF-32 Big Endian:0000feff (no meaning in UCS-4)
UTF-32 Little endian:fffe0000 (no meaning in UCS-4)
GB2312: The 1th bit of both high and low bytes is 1.
big5,gbk&gb18030: The 1th bit of the high byte is 1. The operating system has a default encoding, often GBK, and can be downloaded and upgraded.
By judging the 1th bit of the high byte thereby know is ASCII or Chinese character coding.
Karlson,2009-07-25 13:39:57
- Class Cchinesecode
- {
- Public
- static void Utf_8tounicode (wchar_t* pout,char *ptext); Convert UTF-8 to Unicode
- static void Unicodetoutf_8 (char* pout,wchar_t* ptext); Convert Unicode to UTF-8
- static void UnicodeToGB2312 (char* pout,wchar_t uData); Convert Unicode to GB2312
- static void Gb2312tounicode (wchar_t* pout,char *gbbuffer);//GB2312 converted to Unicode
- static void Gb2312toutf_8 (string& pout,char *ptext, int plen);//gb2312 to UTF-8
- static void utf_8togb2312 (String &pout, char *ptext, int plen);//utf-8 to GB2312
- };
- Class implementation
- void Cchinesecode::utf_8tounicode (wchar_t* pout,char *ptext)
- {
- char* Uchar = (char *) pOut;
- UCHAR[1] = ((ptext[0] & 0x0F) << 4) + ((Ptext[1] >> 2) & 0x0F);
- Uchar[0] = ((ptext[1] & 0x03) << 6) + (Ptext[2] & 0x3F);
- Return
- }
- void Cchinesecode::unicodetoutf_8 (char* pout,wchar_t* ptext)
- {
- Note The order of the WCHAR, low byte in front, high byte in the back
- char* Pchar = (char *) ptext;
- Pout[0] = (0xE0 | ((Pchar[1] & 0xF0) >> 4));
- POUT[1] = (0x80 | ((Pchar[1] & 0x0F) << 2)) + ((pchar[0] & 0xC0) >> 6);
- POUT[2] = (0x80 | (Pchar[0] & 0x3F));
- Return
- }
- void cchinesecode::unicodetogb2312 (char* pout,wchar_t uData)
- {
- WideCharToMultiByte (Cp_acp,null,&udata,1,pout,sizeof (wchar_t), null,null);
- Return
- }
- void Cchinesecode::gb2312tounicode (wchar_t* pout,char *gbbuffer)
- {
- :: MultiByteToWideChar (cp_acp,mb_precomposed,gbbuffer,2,pout,1);
- return;
- }
- void Cchinesecode::gb2312toutf_8 (string& pout,char *ptext, int plen)
- {
- Char buf[4];
- int nlength = plen* 3;
- char* rst = new Char[nlength];
- memset (buf,0,4);
- memset (rst,0,nlength);
- int i = 0;
- int j = 0;
- while (I < Plen)
- {
- If you copy it directly in English, you can
- if (* (Ptext + i) >= 0)
- {
- Rst[j++] = ptext[i++];
- }
- Else
- {
- wchar_t pbuffer;
- Gb2312tounicode (&pbuffer,ptext+i);
- Unicodetoutf_8 (Buf,&pbuffer);
- unsigned short int tmp = 0;
- TMP = Rst[j] = buf[0];
- TMP = rst[j+1] = buf[1];
- TMP = rst[j+2] = buf[2];
- j + = 3;
- i + = 2;
- }
- }
- RST[J] = ";
- return results
- POut = rst;
- delete []rst;
- Return
- }
- void cchinesecode::utf_8togb2312 (String &pout, char *ptext, int plen)
- {
- char * newbuf = new Char[plen];
- Char ctemp[4];
- memset (ctemp,0,4);
- int i = 0;
- int j = 0;
- while (I < Plen)
- {
- if (Ptext > 0)
- {
- Newbuf[j++] = ptext[i++];
- }
- Else
- {
- WCHAR wtemp;
- Utf_8tounicode (&wtemp,ptext + i);
- UnicodeToGB2312 (ctemp,wtemp);
- NEWBUF[J] = ctemp[0];
- Newbuf[j + 1] = ctemp[1];
- i + = 3;
- J + = 2;
- }
- }
- NEWBUF[J] = ";
- POut = Newbuf;
- delete []newbuf;
- Return
- }
C + + encoding conversion