Here I briefly talk about the coding problems that are often encountered in front-end HTML and JavaScript daily work.
In a computer, the information we store is expressed in binary code. We know, the on-screen display of English, Chinese characters and other symbols and stored binary code of the conversion, that is, coding.
There are two basic concepts to note, CharSet and character encoding:
CharSet, character set, which is a table of symbols and a number mapping relationship, is that it determines that 107 is Koubei ' A ', 21475 is the "mouth" of Word-of-mouth, and different tables have different mapping relationships, such as Ascii,gb2312,unicode. With this number and character mapping table, we can convert a binary representation of a number to a character.
Chracter encoding, encoding method. For example, is the same for the "mouth" of the number 21475, we are using \u5k3e3, or%e5%8f%a3 to express it? This is determined by character encoding.
For strings like ' koubei.com ', which are American characters, they developed a character set called ASCII, the American Standard Code of Information interchange US standard Information Interchange code, With the 0–127 128 digits, (2 of 7, 0x00-0x7f) represent the 128 characters commonly used in 123ABC. A total of 7 bits, plus the first is a sign bit, to use to go to the complement code to represent negative numbers, and altogether 8 bits constitute a byte. When Americans were stingy, the world would be a lot less problematic if it was designed to be a byte of bits and bits in the beginning, but at the time, it was estimated that 8 bits would be enough to represent 128 different characters.
It's a computer that's made by Americans, so it's easy for them to get their own symbols coded and pretty cool. But when the computer began to internationalize, the problem came out, take China for example, Chinese characters are good tens of thousands of, how to do?
The existing 8 bits a byte system is the basis, cannot be broken, can not be changed to bits, or so, otherwise the change is too big, can only go the other way: a number of ASCII characters to represent a different character, namely MBCS (Multi-Byte Character Sys TEM, multibyte character system).
With this MBCS concept, we can express more characters, such as we have bits in 2 ASCII characters, and in theory there are 2 16 times 65,536 characters. But how are these encodings assigned to characters? For example, word-of-mouth "mouth" of the Unicode code is 21475, who decided? Character set, which is the charset that just introduced. ASCII is the most basic character set, and on top of that, we have a character set similar to gb2312, Big5, for MBCS in Simplified Chinese and Traditional Chinese, and so on. Finally, there is an organization called Unicode Consortium, which decides to make a standard that includes all characters (UCS, Universal Character Set) and the corresponding encoding method, namely Unicode. Starting from 1991, it released the first edition of Unicode International standards, ISBN 0-321-18578-1, ISO also participated in this customization, ISO/IEC 10646:the Universal Character Set. In short, Unicode is a character standard that basically covers all the existing symbols on Earth, and is now being used more and more widely, and the ECMA standard also stipulates that the internal character of the JavaScript language uses the Unicode standard (which means that JavaScript's variable name, The function name is allowed in Chinese! )。
For developers in China, there are a number of problems that may come across as GBK, gb2312, and utf-8 transitions. Strictly speaking, this is not very accurate, gbk,gb2312 is a character set (CharSet), and Utf-8 is an encoding (character encoding), a way of encoding the UCS character set in the Unicode standard, because the Unicode character set Web pages are mainly encoded with UTF-8, so it is often not accurate to put them in a parallel.
With Unicode, at least before human civilization encountered aliens, this is a master key, use it. Now the most widely used encoding of Unicode is UTF-8 (8-bit ucs/unicode transformation Format), which has several particularly good places:
Coded UCS character set, universal worldwide
is a variable-length coding method (variable-length character encoding), compatible with ASCII
The 2nd is a great advantage, which makes the system previously compatible with pure ASCII encoding, and does not add additional storage (a false-setting long encoding that allows each character to be made up of 2 bytes, so that the ASCII character occupies a larger amount of storage space).
Current 1/2 page
12 Next read the full text