Character Set and encoding 01 -- charset vs encoding, charsetpageencoding
Statement: This article is reprinted from http://my.oschina.net/goldenshaw/blog/304493
In many cases, Character Set and encoding are often confused, but the two are different. As the first step in deep understanding, we must first clarify:
Character SetAnd
Character Set EncodingIs a concept of two different levels
Charset is short for character set, that isCharacter Set.
Encoding is short for charset encoding, that isCharacter Set Encoding, AbbreviationEncoding.
Comparison with interface and interface implementation
You can compare the twoInterfaceAndInterface implementationMake a comparison:
From here we can clearly see that,
Examples and usage
Let's take a look at two examples. One is from an html file and uses charset:
<meta http-equiv="content-type" content="text/html;charset=utf-8">
The other is from an xml file, which uses encoding:
<?xml version="1.0" encoding="UTF-8"?>
Which method is more standardized? Obviously the latter, which is more accurate in terms of the concept of Character Set and encoding.
"Charset = UTF-8" is easy to misunderstand that there is a character set called "UTF-8", but in fact, whether it is a UTF-8 or UTF-16, UTF-32 is only for the same character set of different encoding.
Why strictly differentiate
Character SetAnd
EncodingThese two concepts?
Character SetAnd
EncodingOne-to-one scenario
There are many character encoding schemes. A character set has only one encoding implementation, and the two are one-to-one. For example, GB2312. In this case, no matter how you call them, for example, "GB2312 encoding" or "GB2312 Character Set", it is actually a thing to say, maybe it does not specifically differentiate itself, so it will not be wrong in any case.
Why is one-to-one a common situation?
We take GB2312 as an example. GB = Guo Biao = GB = National Standard. The standard is originally for unification. You have produced N codes for one standard. Which one do you use?
Character SetAnd
EncodingOne-to-multiple scenarios
This is the only thing in Unicode.Unicode Character SetCorresponds to three types of encoding: UTF-8, UTF-16, UTF-32. If the name is still so general, it is easy to confuse.
Why is Unicode so special?
People come up with new character set standards, and the driving force is nothing more than the characters in the old character set are not enough.
The objective of Unicode is to unify all character sets and include all characters. Therefore, when the character set is developed into it, it will go to the beginning. It is unnecessary or unnecessary to complete any new character set.
But what if I think its current encoding scheme is not very good? In the absence of new character sets, we can only make an article on encoding, so we have multiple implementations, which breaks the traditional one-to-one correspondence.
We strictly distinguish character sets and encoding for this reason.
SpecifiedEncoding, Which correspondsCharacter SetNaturally, it is specified,EncodingThis is what we ultimately need to care about.
Unicode comparison
Let's look at a chart that shows some differences between Unicode in the early and present:
Note: For historical reasons, you will also see a mix of Unicode and UTF-8 in many places, in which case Unicode is usually a UTF-16 or an earlier UCS-2 encoding, in the subsequent chapters, we will further analyze.
The following is a "Notepad program" saved, is a non-standard use of Unicode, Here Unicode refers to the UTF-16:
We have mentioned a lot of Unicode. For various reasons, we must admit that in different contexts, the word "Unicode" has different meanings. It may refer:
Unicode Standard
Unicode Character Set
Unicode abstract encoding (number), that isCode point)
A specific Unicode encoding implementation, usuallyVariable LengthUTF-16 (16 or 32 bits), or a later 16-bit UCS-2
These topics will be further discussed in the subsequent chapters.