A computer can only handle binary, so it needs to be represented as binary in order to be understood and recognized by the computer.
A common practice is to assign an ID to each letter or kanji, and then use binary notation for that ID, which exists in memory or on disk. The computer can know what this ID is based on the binary data, and then, based on the ID, knows what letter or kanji the binary data represents.
The thing that Unicode does is to assign an ID to each letter or kanji.
UTF-8, UTF-16, UTF-32 are three ways to represent Unicode code point as a binary method, which we call the encoding format.
What characters are included in the Unicode standard (characters)
The Unicode standard specifies code bits for all the characters in the main language, including many languages in Western Europe, the Middle East, and East Asia. In addition, it includes punctuation, diacritical marks, mathematical symbols, technical symbols, arrows, decorative symbols, emoji, etc.
The most commonly used characters in the starting 64K code point, this part of the codespace is called the basic multi-language plane (basic multilingual plane), referred to as BMP. In addition, there are 16 other supplementary planes for other characters to use.
Unicode retains some code point for individuals, and the manufacturer or individual can specify their own characters or symbols. There are 6,400 private code point in BMP and 131,068 reserved code point in supplementary plane.
Encoding format
That is, the code point is encoded as a binary binary method.
First look at the concept of code unit.
The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.
- UTF-8
The Unicode character is represented as a variable-length bit, which is perfectly compatible with the ASCII code. That is, the ASCII code, in one byte, the other characters are expressed in 2 or more bytes. The code unit is 8bit in length.
- UTF-16
Commonly used with storage space and get character efficiency need to strike a balanced scenario. Commonly used characters are encoded in 16bit (a code unit), and others are encoded with a pair of code unit (pairs of 16-bit code units).
- UTF-32 User storage space is not a problem, it needs to wait for a wide code unit scenario. Each Unicode character is represented as a 32bit.
Element defining text (defining Elements of text)
The constituent elements of the text differ in different scenarios. For example, in the history of Spanish, "LL" is a separate element, but when this character is entered, "LL" is a combination of two "L".
Unicode defines the code elements("characters"), which is the basic element used for computer text processing. In the example above, merging two "L" into a "ll" is a matter of text processing software.
Character sequence (Character sequences)
Sometimes the text element can be represented by more than one character, and these multiple character consist of a sequence called combining character sequences .
For example, "a" can be represented as a combination of "a" and "^". Unicode defines the order of these combinations, usually the basic character "a" in front, followed by a non-spacing (no space?) symbol "^".
Some characters are ordered in order to represent a single character, called precomposed character or composite character , decomposable character . For example, "Ü" has a separate code point U+00FC, which can also be represented as the basic letter "U" (u+0075) followed by a non-spacing character "¨" (u+0308). This is easy to sort, because sometimes the deformation of a letter does not affect the sorting.
that is, a character has multiple representations, and Unicode gives a way to determine whether a character is equal.
Programs should always compare canonical-equivalent Unicode strings as equal (for the details of this requirement, see SEC tion 3.2, Conformance Requirements and sections 3.7, decomposition, in the Unicode standard). One of the easiest ways to does this was to use a normalized form for the strings:if strings be transformed into their norm Alized forms, then canonical-equivalent ones would also have precisely the same binary representation. The Unicode standard provides well-defined normalization forms, the can is used for THIS:NFC and NFD.
Reference
The Unicode? Standard:a Technical Introduction
Unicode Equivalence Wiki
CANONICAL Equivalence in Applications
FAQ Normalization
Introduction to Unicode