Unicode
Unicode( Uniform Code , universal Code, single code) is an industry standard in the field of computer science, including character set, encoding scheme, etc. Unicode is created to address the limitations of traditional character encoding schemes, which set a uniform and unique binary encoding for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing. ( The role of Unicode ) . 1990 year began research and development,1994 year officially announced.
Causes of Unicode occurrences
Before the advent, there were already many different standards, such as the asciiiso 8859-1 koi-8GB 18030 and big 5 ET. This gives rise to the following two questions: one for any given character value, it is possible to correspond to different letters under different coding schemes, and the other is that the encoding length of the language with the large character set may be different. For example, some commonly used characters are single-byte encoded, while others require two or more bytes.
design Unicode above traditional character encoding 80so Young so Simple56 a national bar, never unexpectedly we han culture of profound, we have not only simplified characters, we also have traditional characters, nb
When the Unicode1.0 version was released in 1991 , only less than half of the 65536 code values were used. Java was in the bud at the time , and Java was designed with a Unicode character set, which would be more than 8 The degree of the bit character set design language has a great improvement.
Unicode exploded.
But it didn't happen, and the state we were teasing about was there. Unicode6553616char has not been able to satisfy the description of all unicode character needed.
We know that the character stored in the computer is dependent on the encoding table for its translation, the Code table will find the character in the table corresponding to the number, and then store this number in the computer, this process is the encoding process. When we read the character, a comparison of the decoding, the number corresponding to the character returned to us, this is the decoding process ( so the encoding and decoding should use the same encoding set, otherwise it will produce garbled case ). Here we want to focus on the character corresponding to the numbers in the Code table, this number is the code point (code points), also called Point code. Code point: Refers to an encoded table of a character corresponding to the codes value . (The decoding point is helpful for some of the code points in Java later in this paper.)
In UnicodeStandard, the code point adopts16Write in, plus prefixu+For exampleu+0041is the Latin alphabetAThe code point.UnicodeIn order to solve the situation that does not satisfy the character requirement, the code points are divided into17Level of code (Code plane),0x0000 to 0x10ffff ,Each group is called a plane (Plane), and each plane has 65536 code bits, a total of 1114112 。 At present, however, only a few planes are used. UTF-8 Utf-32 Basic multi-language level (Basic multilingual Plane code point from U+0000 to U+ Ffffunicode code, the rest of the 16 code point level from u+10000 to u+10ffff which includes some auxiliary characters (supplementary character
This is a very necessary explanation. Utf-16 is Unicode) is a way to implement. Namely put unicode16 code element A sequence of 1 2 16
In the basic multi-lingual level, each character is represented by a number of lines, often referred to as a code unit,while auxiliary characters are encoded with a pair of contiguous code units. The encoded values in this form fall within the free 2048 bytes of the basic multilingual level , often referred to as the alternate area (surrogate areas)u+d800 ~U+DBFF for the first code unit,u+ DC00~U+DFFF is used for the second unit of code. This is a clever design that quickly knows whether a code unit is a character encoding or a first or second part of a secondary character. For example, the following character
u+d835u+dd46
using Unicode as the default encoding set in Java, thechar type describes a unit of code in the UTF-16 encoding (which is why char requires two bytes of storage space). In Java , it is strongly recommended that you do not use char types in your programs unless you really need to work with UTF-16 code units. It is a good idea to treat strings as abstract data types.
Question20180127 Java Programmer detailing encoded Unicode