Java Coded Character Set
@author Ixenos
- 1. Character Set
a) the character set establishes a mapping between a sequence of two-byte Unicode symbols and a sequence of bytes using local character encoding.
b) in order to be compatible with other names, each character set has many aliases , and the aliases method of the CharSet object can return a set object composed of aliases
I. set<string> aliases = charset.aliases ();
II. for (String alias:aliases) {...}
III. you can use aliases to get Charset objects : Charset Charset =class.forname ("UTF-8")
C
Jim Liu
Links: https://www.zhihu.com/question/50356029/answer/120608944
Source: Know
Copyright belongs to the author, please contact the author for authorization.
TL;DR
Character Set:A number of characters are included and indexed.
character encoding: A numbered index in a character set that uses the technical requirements and formats that the computer can handle (usually in bytes) to do the binary implementation.
----Split Line----
GB2312is a character set and is also a character encoding, which contains about thousands of characters.
GB18030Is the character set, which is also the encoding of theGB2312A huge extension that contains about 70,000 + characters.
The above two are GB
GBKIs Microsoft toGB2312The extension, compatibleGB2312(and GB18030 is not perfect compatible with GBK), GBK is not GB.
----Split Line----
UnicodeIs the "Universal Code" character set, and it is also a very simple code, but few programs will directly use this code.
More programs will useUnicode Transform Format/utfCoding, the more common isUTF-8AndUTF-16。
----Split Line----
Why not support Unicode and build GB18030?
1.Unicode is used all over the world., is subject to ISO, and is very much compromised in English. It is not necessarily the most perfect coding scheme for Chinese characters. For example, UTF-8 encoding means most Chinese characters require 3 bytes, while GB18030 encoding most Chinese characters require only 2 bytes.
2, GB18030 is perfectly compatible with GB2312, which is very good for legacy system compatibility in Chinese environment. According to Wikipedia, Unicode only included Chinese characters in 1992, and the 1993 Unicode version of 1.1 included the same amount as GB18030, while GB2312 began to spread as early as 80. Not that we are not compatible with Unicode, but that Unicode is incompatible with us. For GB18030, he took on the role of extended GB2312, not simply making wheels to make the standard differentiation can be explained. Windows introduced the GBK in 1995 and it is perfectly compatible with GB2312. In that era, the Internet is not as well developed as today, according to different language environment to choose a cost-effective coding scheme, or understandable.
d) Local encoding mode cannot represent all Unicode characters, and if a character cannot be displayed, it will be converted to "? ".
e) once you have a character set , you can convert between a Java string (a Unicode code element) and a sequence of bytes (encoded)
I. coding (verb encode) Example of a Java string, which translates to a byte array, one or two or three or four bytes representing the character encoding (noun)
II. String str = "...";
III. Bytebuffer buffer = Charset.encode (str);//encode the string using the corresponding character set to return the Bytebuffer object
Iv. byte[] bytes = Buffer.array (); Remove the byte array from the object
V. To decode a byte sequence , you naturally need a byte buffer (Bytebuffer) object, using the static method of Bytebuffer wrap to convert a byte array into a buffer of a Bytebuffer object
vi. byte[] bytes = ...;
VII. Bytebuffer BBUF = bytebuffer.wrap (bytes,offset, length);
Viii. Charbuffer cbuf = Charset.decode (BBUF);//Return Charbuffer Object
IX. String str = cbuf.tostring ();
Extended reading :
Is it related to Unicode on GB18030 root?
Coding Crooked Biography--Basic article
Coding Crooked Biography--web
Code Crooked Biography--The external chapter
Java Coded Character Set