Character Set charset: defines the number of characters contained in a set, that is, the characters that belong to the character set and do not belong to the set, such as ASCII, GBK, Unicode. Almost all other character sets contain the ASCII character set.
Encoding: defines how to store characters in bytes, such as: ASCII (also represents encoding), GBK (also represents encoding), Unicode (also represents encoding), UTF8, UTF16
One character set has one or n types of encoding: 1: n
Sometimes a name represents both a character set and an encoding, such as GBK, ASCII, and Unicode.
Unicode character sets are commonly used in two encoding modes: Unicode encoding and UTF8 encoding. unicode encoding defines that each character occupies 2 bytes. In many systems, Unicode encoding is used in memory, such as C # And Java. For example, in C, new char [10] allocates 10x2 = 20 bytes of space for storing 10 Unicode-encoded characters.
UTF8 defines one English character and three Chinese characters, so that the space and bandwidth occupied by English strings during storage and network transmission are small.
When the string "a" (2 characters) is saved as a file:
Encoding |
Bytes |
Description |
GBK |
61 D6 D0 |
61 is the 'A' encoded by GBK, and D6 D0 is the 'zhong' encoded by GBK' |
Unicode |
FF FE61 00 2D 4E |
Ff fe is the first identifier of a Unicode-encoded file. 61 00 is 'a' of Unicode encoding, while 2D 4E is '中' of Unicode encoding' |
UTF8 |
EF BB BF61 E4 B8 AD |
Ef bb bf is the first identifier of the UTF8 file, 61 is UTF8 encoded 'A', and E4 B8 AD is UTF8 encoded '中' |