Character set common character set classification ASCII and its extended character set
Role: predicative English and Western European languages.
Number of digits: ASCII is represented by 7 bits and can represent 128 characters, and its extension uses 8-bit notation, representing 256 characters.
Range: ASCII from 00 to 7F, extended from 00 to FF.
Iso-8859-1 Character Set
Function: Extended ASCII, representing Western Europe, Greek, etc.
Number of digits: 8 bits, range: from 00 to FF, compatible with ASCII character set.
GB2312 Character Set
Role: National Simplified Chinese character set, compatible with ASCII.
Number of digits: represented by 2 bytes, can represent 7,445 symbols, including 6,763 kanji, almost all high-frequency Chinese characters.
Range: High byte from A1 to F7, low byte from A1 to FE. The high-and low-byte are encoded by adding 0xa0 to each other.
GBK Character Set
Role: It is an extension of GB2312, adding support for traditional characters, compatible with GB2312.
Number of digits: 2 bytes, representing 21,886 characters.
Range: High byte from 81 to Fe, low byte from 40 to FE.
Unicode character Set
Function: Unified coding for 650 languages of the world, compatible with iso-8859-1.
Number of digits: The Unicode character set is encoded in multiple ways, utf-8,utf-16 and UTF-32, respectively.
BIG5 Character Set
Function: Unify traditional Chinese characters encoding.
Number of digits: represented by 2 bytes, representing 13,053 kanji.
Range: High byte from A1 to F9, low byte from 40 to 7E,A1 to FE.
GB18030 Character Set
Function: It solves the encoding of Chinese, Japanese, Korean, etc., and is compatible with GBK.
Number of bits: It takes a variable byte representation (1 ascii,2,4 bytes). can represent 27,484 words.
Range: 1 bytes from 00 to 7F; 2 bytes High bytes from 81 to Fe, low bytes from 40 to 7E and 80 to fe;4 bytes 13th bytes from 81 to Fe, 24th bytes from 30 to 39.
UCS Character Set
Role: The International standard ISO 10646 defines the universal Character set (Universal Character set). It is compatible with Unicode-homogeneous organizations, UCS-2, and Unicode.
Number of digits: it has UCS-2 and UCS-4 two formats, 2 bytes and 4 bytes, respectively.
Scope: At present, UCS-4 only in front of UCS-2 added 0x0000.
Sort by the text that is represented
Language |
Character |
Official name |
English, Western European ASCII |
Iso-8859-1 |
MBCS Multi-byte |
Chinese Simplified |
GB2312 |
MBCS Multi-byte |
Chinese Traditional |
BIG5 |
MBCS Multi-byte |
Simple and Traditional Chinese |
GBK |
MBCS Multi-byte |
Chinese, Japanese and Korean |
GB18030 |
MBCS Multi-byte |
National languages |
Unicode,ucs |
DBCS Wide Byte |
Conversion between encodings:
Requirement: To know the encoding format of the current content and the encoding format to be converted to:
Example:
String username = request.getparameter ("username"). Trim (); String Password = request.getparameter ("password"). Trim ();
Gets the String type variable: username and password are encoded in the following format: iso-8859-1
How to convert them to UTF-8 encoding, do not appear garbled, the code is as follows:
String parameter = Request.getparameter ("username"); Gets the binary number corresponding to the parameter byte[] temp = parameter.getbytes ("iso-8859-1"); Manually encode the string into Utf-8 by the corresponding binary number param = new string (temp, "utf-8");
Principle:
The same content in the computer binary encoding is the same, so in different encodings between content delivery, to not appear garbled, first the content by its original encoding into a binary sequence. The binary sequence is then translated according to the encoding to be converted, and no garbled characters are present.
The meaning of the garbled form appearing:
?????? ---> represents the character encoding mismatch caused by
Ÿ?� ---> representative does not have this encoding method
Common Character Set & garbled problems