3-character encoding model
Programmers often face complex problems, and the simplest way to reduce complexity is to divide and conquer them. Peter Constable describes the four-layer model of Character encoding in his article "character set encoding basics Understanding Character set encodings and legacy encodings. I think this statement can clearly show what happened in character encoding, so I will introduce it here.
3.1 character range (Abstract character repertoire)
The first layer of character encoding is to determine the character range, that is, the characters to be supported. Some encoding schemes have fixed character ranges, such as the ASCII and ISO 8859 series. The character ranges of some encoding schemes are open. For example, the Unicode Character ranges are all characters in the world.
3.2 character (Coded character set)
The second layer of character encoding is used to match characters and numbers. This level can be understood as the character encoding seen by mathematicians (I .e. from a mathematical perspective. The character encoding seen by mathematicians is a positive integer. For example, in Unicode, the number corresponding to the Chinese character "word" is 23383. The number corresponding to the Chinese character "" Is 134192.
When writing an html file, you can insert the character "word" by entering "& #23383 ". However, when designing character encoding, we are still used to representing numbers in hexadecimal notation. Write 23383 as 0x5BD7 and 134192 as 0x20C30.
3.3 Character encoding form)
The third layer of character encoding is to use the basic data types in programming languages to represent characters. This level can be understood as the character encoding seen by the programmer. In Unicode, we have many ways to express a number 23383 into program data, including: UTF-8, UTF-16, UTF-32. UTF is the abbreviation of "uctransformation Format". It can be translated into a Unicode Character Set conversion Format, that is, how to convert Unicode-defined numbers into program data. For example, the numbers corresponding to the "Chinese character" are 0x6c49 and 0x5b57, while the encoded program data is:
BYTE data_utf8 [] = {0xE6, 0xB1, 0x89, 0xE5, 0xAD, 0x97}; // UTF-8 Encoding
WORD data_utf16 [] = {0x6c49, 0x5b57}; // UTF-16 code
DWORD data_utf32 [] = {0x6c49, 0x5b57}; // UTF-32 code
Here, BYTE, WORD, and DWORD are used to represent unsigned 8-bit integers, unsigned 16-bit integers, and unsigned 32-bit integers respectively. UTF-8, UTF-16, UTF-32 respectively BYTE, WORD, DWORD as the encoding unit.
The UTF-8 encoding of Chinese characters requires 6 bytes. The UTF-16 encoding of Chinese character requires two words, the size is 4 bytes. The UTF-32 encoding of Chinese character requires two DWORD, the size is 8 bytes. Section 4.2 describes the rules for ing numbers to UTF Encoding.
Character encoding scheme)
The fourth layer of the character encoding is the character seen by the computer, that is, the byte stream in the file or memory. For example, the UTF-32 of the word is 0x5b57, and if it is represented by little endian, the byte stream is "57 5b 00 ". If big endian is used, the byte stream is "00 00 5b 57 ".
The third layer of character encoding specifies the encoding units in which a character is represented. The fourth layer of character encoding takes into account the byte order within the encoding unit on the basis of the third layer. The encoding unit of the UTF-8 is byte, not influenced by the byte order. UTF-16, UTF-32 according to the different order of bytes, and derived UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE four encoding scheme. LE and BE are short for Little Endian and Big Endian respectively.
Conclusion 3.5
Through the layer-4 model, we sorted out all these things in character encoding. In fact, most code pages do not require a complete layer-4 model. For example, GB18030 uses bytes as the encoding unit and directly defines the ing relationship between the byte sequence and characters, skipping the second layer, no Layer 4 is required.
4. Unicode again
Unicode is a character encoding scheme developed by international organizations to accommodate all texts and symbols in the world. Unicode maps these characters with numbers 0-0x10FFFF. It can contain up to 1114112 characters, or contain 1114112 characters. The bitwise is the number that can be allocated to characters. UTF-8, UTF-16, and UTF-32 are encoding schemes that convert numbers to program data.
The Unicode Character Set can be abbreviated to UCS (Unicode Character Set ). Early Unicode standards had the saying of UCS-2 and UCS-4. The UCS-2 is encoded in two bytes, And the UCS-4 is 4 bytes encoded. The UCS-4 is divided into 2 ^ 7 = 128 groups based on the highest byte with the highest bit of 0. Each group is further divided into 256 planes based on the next high byte ). Each plane is divided into 3rd rows based on 256 bytes, and each row has 256 cells ). The Plane 0 of group 0 is called BMP (Basic Multilingual Plane ). Remove the bmp of the UCS-4 from the first two zero bytes to get the UCS-2.
Unicode standards plan to use 17 planes of group 0: From BMP (plane 0) to plane 16, that is, numbers 0-0x10FFFF. Unicode encoding: This article describes the complete Unicode encoding and views Unicode from multiple perspectives. This article introduces the Unicode 5.0.0-based version.
4.1 browse Unicode
First look at some numbers: each plane has 2 ^ 16 = 65536 digits. The Unicode program uses 17 planes with a total of 17*65536 = 1114112 code bits. In fact, there are currently only 238605 defined code bits distributed on the plane 0, plane 1, plane 2, plane 14, plane 15, and plane 16. In this example, on plane 15 and plane 16, only two Private Use areas (Private Use Area) with 65534 codes each are defined, namely 0xF0000-0xFFFFD and 0x100000-0x10FFFD. A special zone is a region reserved for custom characters, which can be abbreviated as PUA.
Plane 0 also has a dedicated zone: 0xE000-0xF8FF, with 6400 code bits. 0xD800-0xDFFF of the plane 0 has a total of 2048 code bits. It is a special area called Surrogate. Its purpose is described in section 4.2.
238605-65534*2-6400-2408 = 99089. The remaining 99089 defined code bits are distributed on Plane 0, plane 1, plane 2, and plane 14. They correspond to the 99089 characters currently defined by Unicode, including 71226 Chinese characters. The numbers 0, 1, 2, and 14 contain 52080, 3419, 43253, and 337 characters. The 43253 characters in plane 2 are all Chinese characters. The plane 0 defines 27973 Chinese characters.
Before getting to know more about Unicode characters, let's take a look at UCD.
4.1.1 what is UCD
UCD is the abbreviation of Unicode Character Database. UCD consists of plain text or html files that describe the properties and internal relationships of Unicode characters. You can see the latest UCD version on the Unicode Organization website.
Most text files in UCD are Unicode-related data suitable for program analysis. The html file explains the database organization, data format, and meaning. The largest file in ucdis the unihan.txt file in Chinese. In UCD 5.28,221, the size of the unihan.txt file is kb. Unihan.txt contains many indexes with reference values, such as Chinese character radicals, strokes, Pinyin, usage frequency, and order of corner numbers. These indexes are based on some authoritative dictionaries, but most indexes can only retrieve part of Chinese characters.
I introduced UCD to use two concepts: Block and Script.
4.1.2 Block
Blocks.txt in ucdsplits Unicode bitwise into some continuous blocks and describes the purpose of each Block:
Start code bit
End code bit
Block name (English)
Block name (Chinese)
0000
007F
Basic Latin
Basic Latin letters
0080
00FF
Latin-1 Supplement
Add Latin letters-1
0100
017F
Latin Extended-
Latin letter extension-
0180
024F
Latin Extended-B
Expanded Latin letters-B
0250
02AF
IPA Extensions
International Phonetic Alphabet Expansion
02B0
02FF
Spacing Modifier Letters
Lattice Modifier
0300
036F
Combining Diacritical Marks
Combination of additional symbols
0370
03FF
Greek and Coptic
Greek and GoP
0400
04FF
Cyrillic
Silivin
0500
052F
Cyrillic Supplement
Supplement to Spanish
0530
058F
Armenian
Armenia
0590
05FF
Hebrew
Hebrew
0600
06FF
Arabic
Basic Arabic
0700
074F
Syriac
Syrian
0750
077F
Arabic Supplement
Arabic supplement
0780
07BF
Thaana
Tana
07C0
07FF
NKo
N'ko alphabet
0900
097F
Devanagari
Tiancheng document (Sanskrit)
0980
09FF
Bengali
Bangladesh
0A00
0A7F
Gurmukhi
Sike instructor
0A80
0AFF
Gujarati
Gujiawen
0B00
0B7F
Oriya
Oriyawen
0B80
0BFF
Tamil
Thai
0C00
0C7F
Telugu
Tarugu Wen
0C80
0CFF
Kannada
Canadavan
0D00
0D7F
Malayalam
Drawi
0D80
0DFF
Sinhala
Sagarawen
0E00
0E7F
Thai
Thai
0E80
0EFF
Lao
Laos
0F00
0FFF
Tibetan
Tibetan
1000
Effecf
Myanmar
Myanmar
10A0
10FF
Georgian
Georgia
1100
11FF
Hangul Jamo
Korean
1200
137F
Ethiopic
Ethiopia
1380
139F
Ethiopic Supplement
Ethiopian supplement
13A0
13FF
Cherokee
Cherokiwen
1400
167F
Uniied Canadian Aboriginal Syllabics
Canadian Indian dialect
1680
169F
Oham
Ou ganwen
16A0
16FF
Runic
Nordic Ancient Characters
1700
171F
Tagalog
Tagarvin
1720
173F
Hanunoo
Harnonovan
1740
175F
Buhid
Boudevin
1760
177F
Tagbanwa
Tagbanwa text
1780
17FF
Khmer
Khmer
1800
18AF
Mongolian
Mongolian
1900
194F
Limbu
Lin buwen
1950
197F
Tai Le
Dehong Wenwen
1980
19DF
New Tai Lue
Xin Xiaowen
19E0
19FF
Khmer Symbols
Khmer
1A00
1A1F
Buginese
Bujiwen
1B00
1B7F
Balinese
Bali
1D00
1D7F
Phonetic Extensions
Expansion of Latin alphabet
1D80
1DBF
Phonetic Extensions Supplement
Expanded addition of Latin alphabet
1DC0
1DFF
Combining Diacritical Marks Supplement
Combination of additional symbols
1E00
1EFF
Latin Extended Additional
Additional extension of Latin letters
1F00
1FFF
Greek Extended
Greek Extension
2000
206F
General Punctuation
General punctuation
2070
209F
Superscripts and Subscripts
Superscript and subscript
20A0
20CF
Currency Symbols
Currency symbols
20D0
20FF
Combining Diacritical Marks for Symbols
Symbols are appended with a combination
2100
214F
Letterlike Symbols
Letter-like characters
2150
218F
Number Forms
Digit format
2190
21FF
Arrows
Arrow symbol
2200
22FF
Mathematical Operators
Mathematical operator number
2300
23FF
Miscellaneous Technical
Symbol used for zero noise Technology
2400
243F
Control Pictures
Control icon
2440
245F
Optical Character Recognition
Optical Character Recognition
2460
24FF
Enclosed Alphanumerics
Parentheses
2500
257F
Box Drawing
Tab
2580
259F
Block Elements
Square Element
25A0
25FF
Geometric Shapes
Geometric Shape
2600
26FF
Miscellaneous Symbols
Zero sign
2700
27BF
Dingbats
Miscellaneous font
27C0
27EF
Miscellaneous Mathematical Symbols-
Zero-noise math sign-
27F0
27FF
Supplemental Arrows-
Arrow symbol supplement-
2800
28FF
Braille Patterns
Braille
2900
297F
Supplemental Arrows-B
Arrow symbol supplement-B
2980
29FF
Miscellaneous Mathematical Symbols-B
Zero impurity math-B
2A00
2AFF
Supplemental Mathematical Operators
Mathematical operator number
2B00
2BFF
Miscellaneous Symbols and Arrows
Zero sign and Arrow
2C00
2C5F
Glagolitic
The English alphabet
2600.
2C7F
Latin Extended-C
Expansion of Latin letters-C
2C80
2CFF
Coptic
Coppto
2D00
2D2F
Georgian Supplement
Georgia supplement
2D30
2D7F
Tifinagh
Letter not included
2D80
2DDF
Ethiopic Extended
Ethiopian Expansion
2E00
2E7F
Supplemental Punctuation
Punctuation supplement
2E80
2EFF
CJK Radicals Supplement
Added by China, Japan, and South Korea
2F00
2FDF
Kangxi Radicals
Kangxi Dictionary radical
2FF0
2FFF
Ideographic Description Characters
Chinese character structure description characters
3000
303F
CJK Symbols and Punctuation
China, Japan, and South Korea symbols and punctuation
3040
309F
Hiragana
Hirakana
30A0
30FF
Katakana
Katakana
3100
312F
Bopomofo
Phonetic symbols
3130
318F
Hangul Compatibility Jamo
Korean compatible letters
3190
319F
Kanbun
Chinese character comments in Japanese
31A0
31BF
Bopomofo Extended
Phonetic symbol Extension
31C0
31EF
CJK Strokes
China, Japan, and stroke
31F0
31FF
Katakana Phonetic Extensions
Katakana phonetic alphabet Extension
3200
32FF
Enclosed CJK Letters and Months
China, Japan, and month with parentheses
3300
33FF
CJK Compatibility
Compatible with Chinese and Japanese characters
3400
4DBF
CJK uniied Ideographs Extension
Expansion of unified ideographic texts of China, Japan, and South Korea
4DC0
4DFF
Yijing Hexagram Symbols
Yijing 14th Xiangxiang
4E00
9FFF
CJK uniied Ideographs
Unified ideographic texts of China, Japan, and South Korea
A000
A48F
Yi Syllables
Yi-wen syllable
A490
A4CF
Yi Radicals
Yiwen font Root
A700
A71F
Modifier Tone Letters
Tone-modified letters
A720
A7FF
Latin Extended-D
Expansion of Latin letters-D
A800
A82F
Syloti Nagri
Syloti Nagri alphabet
A840
A87F
Phags-pa
Phags-pa alphabet
AC00
D7AF
Hangul Syllables
Korean syllable
D800
DB7F
High Surrogates
High substitution
DB80
DBFF
High Private Use Surrogates
High-end dedicated replacement
DC00
DFFF
Low Surrogates
Low Position substitution
E000
F8FF
Private Use Area
Private Zone
F900
FAFF
CJK Compatibility Ideographs
Compatible with Chinese and Japanese ideographic texts
FB00
FB4F
Alphabetic Presentation Forms
Letter variant display form
FB50
FDFF
Arabic Presentation Forms-
Arabic Variant Form-
FE00
FE0F
Variation Selectors
Font transform Selector
FE10
FE1F
Vertical Forms
Vertical punctuation
FE20
FE2F
Combining Half Marks
Combination of halfwidth labels
FE30
FE4F
CJK Compatibility Forms
China-Japan-Korea compatibility
FE50
FE6F
Small Form Variants
Small Variant Form
FE70
FEFF
Arabic Presentation Forms-B
Arabic variant appearance-B
FF00
FFEF
Halfwidth and Fullwidth Forms
Halfwidth and fullwidth characters
FFF0
FFFF
Specials
Special Area
10000
1007F
Linear B Syllabary
Linear Text B-syllable text
10080
100FF
Linear B Ideograms
Linear Text B ideographic text
10100
1013F
Aegean Numbers
Aegean digital
10140
1018F
Specified ent Greek Numbers
Ancient Greek numerals
10300
1032F
Old Italic
Ancient Italian
10330
1034F
Gothic
Govyn
10380
1039F
Ugaritic
Ugarrit-shaped text
103A0
103DF
Old Persian
Gubosvin
10400
1044F
Deseret
Deselet University
10450
1047F
Shavian
Su Berner stenographer
10480
Vodaf
Osmanya
Osmanya alphabet
10800
1083F
Cypriot Syllabary
Cyprus text
10900
1091F
Phoenician
Phoenix
10A00
10A5F
Kharoshthi
Garo shiwen
12000
123FF
Cuneiform
Wedge text
12400
1247F
Cuneiform Numbers and Punctuation
Wedge numbers and punctuation
1D000
1D0FF
Byzantine Musical Symbols
Orthodox Music symbols
1D100
1D1FF
Musical Symbols
Music symbols
1D200
1D24F
Invalid ent Greek Musical Notation
Ancient Greek music symbols
1D300
1D35F
Tai Xuan Jing Symbols
Taoxuan Sutra symbol
1D360
1D37F
Counting Rod Numerals
Computing and funding
1D400
1D7FF
Mathematical Alphanumeric Symbols
Use letters and numbers for mathematics
20000
2A6DF
CJK uniied Ideographs Extension B
Expansion of unified ideographic texts of China, Japan, and South Korea B
2F800
2FA1F
CJK Compatibility Ideographs Supplement
Chinese-Japanese compatible ideographic supplements
E0000
E007F
Tags
Tag
E0100
E01EF
Variation Selectors Supplement
Font transform selector supplement
F0000
FFFFF
Supplementary Private Use Area-
Additional dedicated area-
100000
10 FFFF
Supplementary Private Use Area-B
Additional dedicated area-B
Block is an attribute of Unicode characters. The characters belonging to the same Block have similar purposes. The start and end code bits in the Block table are only used to divide a region. There may be many undefined code BITs between the start and end code bits. With UniToy, you can browse Unicode characters by Block, which can be displayed by list:
You can also display the details of each character: