Discussion on text encoding and Unicode (medium)

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

3-character encoding model
Programmers often face complex problems, and the simplest way to reduce complexity is to divide and conquer them. Peter Constable describes the four-layer model of Character encoding in his article "character set encoding basics Understanding Character set encodings and legacy encodings. I think this statement can clearly show what happened in character encoding, so I will introduce it here.
3.1 character range (Abstract character repertoire)
The first layer of character encoding is to determine the character range, that is, the characters to be supported. Some encoding schemes have fixed character ranges, such as the ASCII and ISO 8859 series. The character ranges of some encoding schemes are open. For example, the Unicode Character ranges are all characters in the world.
3.2 character (Coded character set)
The second layer of character encoding is used to match characters and numbers. This level can be understood as the character encoding seen by mathematicians (I .e. from a mathematical perspective. The character encoding seen by mathematicians is a positive integer. For example, in Unicode, the number corresponding to the Chinese character "word" is 23383. The number corresponding to the Chinese character "" Is 134192.

When writing an html file, you can insert the character "word" by entering "& #23383 ". However, when designing character encoding, we are still used to representing numbers in hexadecimal notation. Write 23383 as 0x5BD7 and 134192 as 0x20C30.
3.3 Character encoding form)
The third layer of character encoding is to use the basic data types in programming languages to represent characters. This level can be understood as the character encoding seen by the programmer. In Unicode, we have many ways to express a number 23383 into program data, including: UTF-8, UTF-16, UTF-32. UTF is the abbreviation of "uctransformation Format". It can be translated into a Unicode Character Set conversion Format, that is, how to convert Unicode-defined numbers into program data. For example, the numbers corresponding to the "Chinese character" are 0x6c49 and 0x5b57, while the encoded program data is:

BYTE data_utf8 [] = {0xE6, 0xB1, 0x89, 0xE5, 0xAD, 0x97}; // UTF-8 Encoding
WORD data_utf16 [] = {0x6c49, 0x5b57}; // UTF-16 code
DWORD data_utf32 [] = {0x6c49, 0x5b57}; // UTF-32 code

Here, BYTE, WORD, and DWORD are used to represent unsigned 8-bit integers, unsigned 16-bit integers, and unsigned 32-bit integers respectively. UTF-8, UTF-16, UTF-32 respectively BYTE, WORD, DWORD as the encoding unit.

The UTF-8 encoding of Chinese characters requires 6 bytes. The UTF-16 encoding of Chinese character requires two words, the size is 4 bytes. The UTF-32 encoding of Chinese character requires two DWORD, the size is 8 bytes. Section 4.2 describes the rules for ing numbers to UTF Encoding.
Character encoding scheme)
The fourth layer of the character encoding is the character seen by the computer, that is, the byte stream in the file or memory. For example, the UTF-32 of the word is 0x5b57, and if it is represented by little endian, the byte stream is "57 5b 00 ". If big endian is used, the byte stream is "00 00 5b 57 ".

The third layer of character encoding specifies the encoding units in which a character is represented. The fourth layer of character encoding takes into account the byte order within the encoding unit on the basis of the third layer. The encoding unit of the UTF-8 is byte, not influenced by the byte order. UTF-16, UTF-32 according to the different order of bytes, and derived UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE four encoding scheme. LE and BE are short for Little Endian and Big Endian respectively.
Conclusion 3.5
Through the layer-4 model, we sorted out all these things in character encoding. In fact, most code pages do not require a complete layer-4 model. For example, GB18030 uses bytes as the encoding unit and directly defines the ing relationship between the byte sequence and characters, skipping the second layer, no Layer 4 is required.
4. Unicode again
Unicode is a character encoding scheme developed by international organizations to accommodate all texts and symbols in the world. Unicode maps these characters with numbers 0-0x10FFFF. It can contain up to 1114112 characters, or contain 1114112 characters. The bitwise is the number that can be allocated to characters. UTF-8, UTF-16, and UTF-32 are encoding schemes that convert numbers to program data.

The Unicode Character Set can be abbreviated to UCS (Unicode Character Set ). Early Unicode standards had the saying of UCS-2 and UCS-4. The UCS-2 is encoded in two bytes, And the UCS-4 is 4 bytes encoded. The UCS-4 is divided into 2 ^ 7 = 128 groups based on the highest byte with the highest bit of 0. Each group is further divided into 256 planes based on the next high byte ). Each plane is divided into 3rd rows based on 256 bytes, and each row has 256 cells ). The Plane 0 of group 0 is called BMP (Basic Multilingual Plane ). Remove the bmp of the UCS-4 from the first two zero bytes to get the UCS-2.

Unicode standards plan to use 17 planes of group 0: From BMP (plane 0) to plane 16, that is, numbers 0-0x10FFFF. Unicode encoding: This article describes the complete Unicode encoding and views Unicode from multiple perspectives. This article introduces the Unicode 5.0.0-based version.
4.1 browse Unicode
First look at some numbers: each plane has 2 ^ 16 = 65536 digits. The Unicode program uses 17 planes with a total of 17*65536 = 1114112 code bits. In fact, there are currently only 238605 defined code bits distributed on the plane 0, plane 1, plane 2, plane 14, plane 15, and plane 16. In this example, on plane 15 and plane 16, only two Private Use areas (Private Use Area) with 65534 codes each are defined, namely 0xF0000-0xFFFFD and 0x100000-0x10FFFD. A special zone is a region reserved for custom characters, which can be abbreviated as PUA.

Plane 0 also has a dedicated zone: 0xE000-0xF8FF, with 6400 code bits. 0xD800-0xDFFF of the plane 0 has a total of 2048 code bits. It is a special area called Surrogate. Its purpose is described in section 4.2.

238605-65534*2-6400-2408 = 99089. The remaining 99089 defined code bits are distributed on Plane 0, plane 1, plane 2, and plane 14. They correspond to the 99089 characters currently defined by Unicode, including 71226 Chinese characters. The numbers 0, 1, 2, and 14 contain 52080, 3419, 43253, and 337 characters. The 43253 characters in plane 2 are all Chinese characters. The plane 0 defines 27973 Chinese characters.

Before getting to know more about Unicode characters, let's take a look at UCD.
4.1.1 what is UCD
UCD is the abbreviation of Unicode Character Database. UCD consists of plain text or html files that describe the properties and internal relationships of Unicode characters. You can see the latest UCD version on the Unicode Organization website.

Most text files in UCD are Unicode-related data suitable for program analysis. The html file explains the database organization, data format, and meaning. The largest file in ucdis the unihan.txt file in Chinese. In UCD 5.28,221, the size of the unihan.txt file is kb. Unihan.txt contains many indexes with reference values, such as Chinese character radicals, strokes, Pinyin, usage frequency, and order of corner numbers. These indexes are based on some authoritative dictionaries, but most indexes can only retrieve part of Chinese characters.

I introduced UCD to use two concepts: Block and Script.
4.1.2 Block
Blocks.txt in ucdsplits Unicode bitwise into some continuous blocks and describes the purpose of each Block:

Start code bit
End code bit
Block name (English)
Block name (Chinese)

0000
007F
Basic Latin
Basic Latin letters

0080
00FF
Latin-1 Supplement
Add Latin letters-1

0100
017F
Latin Extended-
Latin letter extension-

0180
024F
Latin Extended-B
Expanded Latin letters-B

0250
02AF
IPA Extensions
International Phonetic Alphabet Expansion

02B0
02FF
Spacing Modifier Letters
Lattice Modifier

0300
036F
Combining Diacritical Marks
Combination of additional symbols

0370
03FF
Greek and Coptic
Greek and GoP

0400
04FF
Cyrillic
Silivin

0500
052F
Cyrillic Supplement
Supplement to Spanish

0530
058F
Armenian
Armenia

0590
05FF
Hebrew
Hebrew

0600
06FF
Arabic
Basic Arabic

0700
074F
Syriac
Syrian

0750
077F
Arabic Supplement
Arabic supplement

0780
07BF
Thaana
Tana

07C0
07FF
NKo
N'ko alphabet

0900
097F
Devanagari
Tiancheng document (Sanskrit)

0980
09FF
Bengali
Bangladesh

0A00
0A7F
Gurmukhi
Sike instructor

0A80
0AFF
Gujarati
Gujiawen

0B00
0B7F
Oriya
Oriyawen

0B80
0BFF
Tamil
Thai

0C00
0C7F
Telugu
Tarugu Wen

0C80
0CFF
Kannada
Canadavan

0D00
0D7F
Malayalam
Drawi

0D80
0DFF
Sinhala
Sagarawen

0E00
0E7F
Thai
Thai

0E80
0EFF
Lao
Laos

0F00
0FFF
Tibetan
Tibetan

1000
Effecf
Myanmar
Myanmar

10A0
10FF
Georgian
Georgia

1100
11FF
Hangul Jamo
Korean

1200
137F
Ethiopic
Ethiopia

1380
139F
Ethiopic Supplement
Ethiopian supplement

13A0
13FF
Cherokee
Cherokiwen

1400
167F
Uniied Canadian Aboriginal Syllabics
Canadian Indian dialect

1680
169F
Oham
Ou ganwen

16A0
16FF
Runic
Nordic Ancient Characters

1700
171F
Tagalog
Tagarvin

1720
173F
Hanunoo
Harnonovan

1740
175F
Buhid
Boudevin

1760
177F
Tagbanwa
Tagbanwa text

1780
17FF
Khmer
Khmer

1800
18AF
Mongolian
Mongolian

1900
194F
Limbu
Lin buwen

1950
197F
Tai Le
Dehong Wenwen

1980
19DF
New Tai Lue
Xin Xiaowen

19E0
19FF
Khmer Symbols
Khmer

1A00
1A1F
Buginese
Bujiwen

1B00
1B7F
Balinese
Bali

1D00
1D7F
Phonetic Extensions
Expansion of Latin alphabet

1D80
1DBF
Phonetic Extensions Supplement
Expanded addition of Latin alphabet

1DC0
1DFF
Combining Diacritical Marks Supplement
Combination of additional symbols

1E00
1EFF
Latin Extended Additional
Additional extension of Latin letters

1F00
1FFF
Greek Extended
Greek Extension

2000
206F
General Punctuation
General punctuation

2070
209F
Superscripts and Subscripts
Superscript and subscript

20A0
20CF
Currency Symbols
Currency symbols

20D0
20FF
Combining Diacritical Marks for Symbols
Symbols are appended with a combination

2100
214F
Letterlike Symbols
Letter-like characters

2150
218F
Number Forms
Digit format

2190
21FF
Arrows
Arrow symbol

2200
22FF
Mathematical Operators
Mathematical operator number

2300
23FF
Miscellaneous Technical
Symbol used for zero noise Technology

2400
243F
Control Pictures
Control icon

2440
245F
Optical Character Recognition
Optical Character Recognition

2460
24FF
Enclosed Alphanumerics
Parentheses

2500
257F
Box Drawing
Tab

2580
259F
Block Elements
Square Element

25A0
25FF
Geometric Shapes
Geometric Shape

2600
26FF
Miscellaneous Symbols
Zero sign

2700
27BF
Dingbats
Miscellaneous font

27C0
27EF
Miscellaneous Mathematical Symbols-
Zero-noise math sign-

27F0
27FF
Supplemental Arrows-
Arrow symbol supplement-

2800
28FF
Braille Patterns
Braille

2900
297F
Supplemental Arrows-B
Arrow symbol supplement-B

2980
29FF
Miscellaneous Mathematical Symbols-B
Zero impurity math-B

2A00
2AFF
Supplemental Mathematical Operators
Mathematical operator number

2B00
2BFF
Miscellaneous Symbols and Arrows
Zero sign and Arrow

2C00
2C5F
Glagolitic
The English alphabet

2600.
2C7F
Latin Extended-C
Expansion of Latin letters-C

2C80
2CFF
Coptic
Coppto

2D00
2D2F
Georgian Supplement
Georgia supplement

2D30
2D7F
Tifinagh
Letter not included

2D80
2DDF
Ethiopic Extended
Ethiopian Expansion

2E00
2E7F
Supplemental Punctuation
Punctuation supplement

2E80
2EFF
CJK Radicals Supplement
Added by China, Japan, and South Korea

2F00
2FDF
Kangxi Radicals
Kangxi Dictionary radical

2FF0
2FFF
Ideographic Description Characters
Chinese character structure description characters

3000
303F
CJK Symbols and Punctuation
China, Japan, and South Korea symbols and punctuation

3040
309F
Hiragana
Hirakana

30A0
30FF
Katakana
Katakana

3100
312F
Bopomofo
Phonetic symbols

3130
318F
Hangul Compatibility Jamo
Korean compatible letters

3190
319F
Kanbun
Chinese character comments in Japanese

31A0
31BF
Bopomofo Extended
Phonetic symbol Extension

31C0
31EF
CJK Strokes
China, Japan, and stroke

31F0
31FF
Katakana Phonetic Extensions
Katakana phonetic alphabet Extension

3200
32FF
Enclosed CJK Letters and Months
China, Japan, and month with parentheses

3300
33FF
CJK Compatibility
Compatible with Chinese and Japanese characters

3400
4DBF
CJK uniied Ideographs Extension
Expansion of unified ideographic texts of China, Japan, and South Korea

4DC0
4DFF
Yijing Hexagram Symbols
Yijing 14th Xiangxiang

4E00
9FFF
CJK uniied Ideographs
Unified ideographic texts of China, Japan, and South Korea

A000
A48F
Yi Syllables
Yi-wen syllable

A490
A4CF
Yi Radicals
Yiwen font Root

A700
A71F
Modifier Tone Letters
Tone-modified letters

A720
A7FF
Latin Extended-D
Expansion of Latin letters-D

A800
A82F
Syloti Nagri
Syloti Nagri alphabet

A840
A87F
Phags-pa
Phags-pa alphabet

AC00
D7AF
Hangul Syllables
Korean syllable

D800
DB7F
High Surrogates
High substitution

DB80
DBFF
High Private Use Surrogates
High-end dedicated replacement

DC00
DFFF
Low Surrogates
Low Position substitution

E000
F8FF
Private Use Area
Private Zone

F900
FAFF
CJK Compatibility Ideographs
Compatible with Chinese and Japanese ideographic texts

FB00
FB4F
Alphabetic Presentation Forms
Letter variant display form

FB50
FDFF
Arabic Presentation Forms-
Arabic Variant Form-

FE00
FE0F
Variation Selectors
Font transform Selector

FE10
FE1F
Vertical Forms
Vertical punctuation

FE20
FE2F
Combining Half Marks
Combination of halfwidth labels

FE30
FE4F
CJK Compatibility Forms
China-Japan-Korea compatibility

FE50
FE6F
Small Form Variants
Small Variant Form

FE70
FEFF
Arabic Presentation Forms-B
Arabic variant appearance-B

FF00
FFEF
Halfwidth and Fullwidth Forms
Halfwidth and fullwidth characters

FFF0
FFFF
Specials
Special Area

10000
1007F
Linear B Syllabary
Linear Text B-syllable text

10080
100FF
Linear B Ideograms
Linear Text B ideographic text

10100
1013F
Aegean Numbers
Aegean digital

10140
1018F
Specified ent Greek Numbers
Ancient Greek numerals

10300
1032F
Old Italic
Ancient Italian

10330
1034F
Gothic
Govyn

10380
1039F
Ugaritic
Ugarrit-shaped text

103A0
103DF
Old Persian
Gubosvin

10400
1044F
Deseret
Deselet University

10450
1047F
Shavian
Su Berner stenographer

10480
Vodaf
Osmanya
Osmanya alphabet

10800
1083F
Cypriot Syllabary
Cyprus text

10900
1091F
Phoenician
Phoenix

10A00
10A5F
Kharoshthi
Garo shiwen

12000
123FF
Cuneiform
Wedge text

12400
1247F
Cuneiform Numbers and Punctuation
Wedge numbers and punctuation

1D000
1D0FF
Byzantine Musical Symbols
Orthodox Music symbols

1D100
1D1FF
Musical Symbols
Music symbols

1D200
1D24F
Invalid ent Greek Musical Notation
Ancient Greek music symbols

1D300
1D35F
Tai Xuan Jing Symbols
Taoxuan Sutra symbol

1D360
1D37F
Counting Rod Numerals
Computing and funding

1D400
1D7FF
Mathematical Alphanumeric Symbols
Use letters and numbers for mathematics

20000
2A6DF
CJK uniied Ideographs Extension B
Expansion of unified ideographic texts of China, Japan, and South Korea B

2F800
2FA1F
CJK Compatibility Ideographs Supplement
Chinese-Japanese compatible ideographic supplements

E0000
E007F
Tags
Tag

E0100
E01EF
Variation Selectors Supplement
Font transform selector supplement

F0000
FFFFF
Supplementary Private Use Area-
Additional dedicated area-

100000
10 FFFF
Supplementary Private Use Area-B
Additional dedicated area-B

Block is an attribute of Unicode characters. The characters belonging to the same Block have similar purposes. The start and end code bits in the Block table are only used to divide a region. There may be many undefined code BITs between the start and end code bits. With UniToy, you can browse Unicode characters by Block, which can be displayed by list:

You can also display the details of each character:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Discussion on text encoding and Unicode (medium)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support