Zi Yue: Garbled is a kind of missing, and missing is a kind of disease. Believe that many web people often tangled in the garbled, may be the presentation, may be the form submission, may be the database, may be the interface, may be crawling ... Anyway, any one involved in the input or output characters of the place may be you have encountered garbled.
In order to explain and solve the garbled problem, and to clarify some common misunderstandings, I intend to write a series, introduce some character encoding related things, currently planning 3.
Basically the content will be more biased science nature, I hope the great God light pat, because I believe these will be very important basic knowledge, if you can read the whole, will certainly in the future when you encounter garbled time to help you think way, quickly find the reason. character
Our language is basically around the character , is character, often referred to as char, many times the character will be the smallest unit of text (note only "very often", because the world is wonderful).
Not necessarily the words are called characters, some phonetic characters, mathematical symbols, some of the characters in the decorated symbols, special symbols, table symbols, and even emoji and so on, in fact, are characters. Character Set
A set of characters to use, such as "English Alphabet" is a character sets, of course, so that sounds to the computer meaningless.
Generally speaking, the character set (Character set) is a specification that contains a number of characters and assigns one number to each of these characters as an index (in order to confuse the concept of coding Earlier, I call it "Number"). ASCII
ASCII is the most classic character set in the world of today's computer, which includes English letters and several punctuation marks, as well as some control characters that are specifically for use by computers (not for people to see). GB Series
GB2312 is a common Chinese character set, in which "GB" is "GB" (our country many different industries of the standard code names are such names). It contains about thousands of characters, and hundreds of western characters.
GBK is Microsoft's first in the Win95 to implement the expansion of the GB2312, the addition of a lot of traditional characters and western characters, the total number of characters included about 20,000. K is the phonetic "extension".
GB18030 is the national standard for GB2312 upgrade (of course, there are other upgrades, but most of them are submerged in the historical trend inside), it has included more than 70,000 characters, the largest part of the upgrade has a traditional Chinese characters, new words, uncommon characters, minority characters, Japanese and Korean character.
The three GB character sets are designed for Chinese and, of course, expand the content of East Asian languages (CJK characters, Chinese, Japanese, Korean), because these neighborhood characters are also very common in China. Big5
Also called five yards, is the traditional Chinese area, such as Bay Bay, Hong Kong, Macao and Macao commonly used character sets, probably included more than 10,000 characters, of which the traditional Chinese mainly. It becomes the fact standard because it is accepted by Windows as the default encoding of the medium version. UCS
As in Chinese, almost every language has a problem with designing a character set for its own language.
Recognizing this problem, ISO designed a set of Universal Character set UCS(Universal Character Set) to represent the world (even aliens) in a set of character sets. ) of all characters.
Results UCS success, because the internet has developed too fast, people in any country every day on the Internet to browse the different languages from all over the world of different language content, we certainly hope that a set of character sets can be included in all the world's characters. character encoding
Many people confuse character sets with the concept of character encoding , which is not really weird, because these two things are often defined in bundles.
character encoding (Character Encoding) is to encode each character in the character set according to certain technical requirements, such as 8bit, so that the text can be used on both the computer and the network transport.
Simply put, the number of each character in the character set is made into a computer-literate format.
Many character sets in the formulation of the time, it has been supporting its coding scheme, such as ASCII, GB series, Big5. For this character set/encoding, the salutation is vague, but there is generally no misunderstanding in the context of the technology. ASCII
Standard ASCII contains only 128 characters and can be encoded perfectly using 7bit. For example, the ASCII encoding of English letter A is hexadecimal 0x41, and then the remaining 1 bytes are useless and can be used as parity.
Later, ASCII was extended to 8bit for 256 characters, perfectly encoded in 8bit, 1 bytes, and 7 bytes full compatible.
ASCII is an international standard and extended ASCII is not, and the "compatible ASCII" mentioned below refers to standard ASCII compatible with 7bit. GB2312
GB2312 uses a 1/2-byte variable length encoding, the single-byte portion is compatible with ASCII, and the other thousands of characters are encoded in Double-byte.
GB2312 in the encoding of the use of a "zoning" concept, as a child home has a location Code table, is in line with the ancient Windows "location Input Method" used. GBK
GBK's coding scheme is a superset of the GB2312, which is fully compatible with GB2312, but uses the encoded space that is not defined in GB2312. GB18030
GB18030 's coding scheme is slightly more complex, and it uses a 1/2/4 byte variable-length encoding scheme. It is fully compatible with GB2312 and is basically compatible with GBK. Big5
BIG5 uses a fixed two-byte encoding, its first byte avoids the scope of ASCII, so actually it can be approximately compatible with ASCII, because its low byte contains some ASCII characters, this compatibility is not perfect, the specific situation can look at Wikipedia, very interesting. Unicode
Unicode has a very tall Chinese name is called the Universal Code, hehe, this name really exudes the breath of agriculture heavy metal. In fact, it is also a character set, it and UCS have a subtle high similarity between the two sides of the organization are aware of the division is not good, and reached a high degree of agreement between each other. Although they are indeed two different standards, many times it is no harm to confuse the view.
Unicode is a fixed-length encoding, which, depending on the version, has a version of 2 bytes (corresponding to UCS-2), 4 bytes (corresponding to UCS-4).
Because Unicode is fixed-length, it is too simple and rough. For example, if you use 4-byte Unicode to transfer English text is 3 times times the volume, and 2-byte version is also uncomfortable, one small capacity, and secondly for the English text is also a waste. It has been optimized for implementation, known as the Unicode conversion format (Unicode Transformation format) , which is our familiar UTF . UTF-32
UTF-32 is the simplest way to implement UCS-4, which is simply to use a fixed length of 4 bytes.
The downside is obviously a waste of volume.
The advantage is also some, the first is to convert it to Unicode is the simplest, and for "[I] characters" This random access is also very good calculation, direct bytes/4 is the right No.
But because of the existence of combination characters (such as Vietnamese, which is used to make a very long tear icon in the same way), a UTF-32 code element (4 bytes) is not strictly a text-editing unit, and in this case UTF-32 does not have much advantage over the typesetting system. UTF-16
UTF-16 is a UCS-4 variable length encoding that is implemented using 2/4 bytes.
Because most of the time you use no more than 65,536 characters, the UTF-16 most of the time 1 characters are only 2 bytes, which saves nearly half the volume compared to UTF-32, and its parsing is not too cumbersome.
Fixed-length coding for computer programs has a very large advantage is that string processing is much easier, especially the implementation of regular expressions. So many modern languages, such as C#/java's string interior implementations use UTF-16, because it is an efficient and volume-balanced coding method. UTF-8
The UTF-8 should be the most widely used unified language coding approach on the Internet today.
It is a 1-4-byte variable-length encoding (originally 1-6 bytes, but since the latter are beyond the Unicode definition, it is later converted to 1-4 bytes). Single-byte cases are compatible with ASCII, which is a very good feature in this English-dominated Internet environment, because it is very time-saving and does not require coding conversions at all.
But its shortcomings are also quite obvious, the conversion of UTF-8 to Unicode algorithm will be more complex, less efficient.
UTF-8 is also a disadvantage for Chinese environments because using UTF-8 encoding most Chinese characters requires 3 bytes, which wastes more space than GB series and UTF-16.
UTF-8 does not encode more than 0x10ffff, so strictly speaking it is only a subset of UCS-4. Fortunately, the missing part itself is not subject to ucs/unicode attention, it is too much of a corner of the horn.
I think UTF-8 eventually became the Internet mainstream a lot because its single-byte is compatible with ASCII. Periodic summary Character Set
Included a lot of characters, and numbered, for people to see. Coding
Implementation of a character set, its number by a certain rule in binary implementation, to the computer to see. GB Series
China's national standard character set/coding, GB2312 and GBK has been basically outdated, if you want to support the Japanese and Korean, and can not escape the clutches of the GBK (such as historical code constraints), that may consider upgrading to GB18030, which is the latest version of GB, is also the most advanced version. Ucs/unicode
All the world's millions of characters you have seen or you have not seen all of the character set into a set of character sets, has been accepted worldwide as an international standard. UTF
Ucs/unicode conversion format, is a practical coding scheme for computer implementation. UTF-16
Ucs/unicode is a compromise between processing efficiency and storage space coding implementation, often used in various modern languages as string internal coding. UTF-8
Ucs/unicode is an approach to space-saving coding implementations because of its compatibility with ASCII, which is very beneficial to English text and becomes the mainstream of today's Internet (even the fact standard).
If your site does not have any historical baggage, directly on the UTF-8 don't discuss.
If your site has some historical baggage, discuss the UTF-8 bar, the burden of the interface to convert the code. Forecast
Oh, although the title of the article is called "File Code", but in fact, the contents of the above is not crooked.
There is no audience can not go on: "Please, you tell me what a mess of theoretical knowledge I have no interest ah, I want to know is actually just why my web page will be garbled ah old wet." ”
For the above question I only want to say four words: please contact me, please see the following: "File code--web chapter"