Reprint: http://www.qianxingzhem.com/post-1499.htmlWeb page coding that's the thing.
Coding has always been a headache for the novice, especially GBK, GB2312, UTF-8 the three more common differences in the page encoding, but also let many novice disoriented, how to explain the explanation is not clear. But coding is so important, especially in this piece of Web page. If you call out is not garbled, and the Web page garbled, most of the reasons are on the code. In addition to garbled, there will be some other problems (such as: IE6 CSS loading problem) and so on. The Stalker M writes the purpose of this article, is to thoroughly explain this coding problem! If you're having a similar problem, take a closer look at this post.ANSI, GBK, GB2312, UTF-8, GB18030, and UNICODE
These few coding keywords are more common, although I put us together to say, but does not mean that these things are a lateral relationship. This part of the content, quoted from the network slightly modified, do not know the source of the original, it can not be signed.
A long time ago, a group of people decided to use 8 transistors that could be opened and closed to form different states to represent everything in the world, which they called "bytes". Later, they did some of the machines that can handle these bytes, the machine started, can use bytes to assemble a lot of states, the state began to change, they called the machine "computer."
Start computer is only used in the United States. A total of eight bytes can be combined with 256 (2 of 8) different states. They have a number from 0 to start the 32 states have specified a special purpose, but the terminal, the printer when the agreed-upon these bytes are transmitted, it is necessary to do some of the agreed action. Meet 00x10, terminal on line, meet 0x07, terminal on people toot called, example Good met 0x1b, printer on the print anti-white word, or terminal on the color display letters. They see this very well, so they call these 0x20 byte states as "control codes".
They also put all the spaces, punctuation, numbers, uppercase and lowercase letters with a continuous byte state, has been compiled into the 127th, so that the computer can use different bytes to store English text. You see this, all feel good, so everyone put this program is called ANSI "Ascii" code (American Standard Code for Information interchange, U.S. Information Interchange Standard code). All the computers in the world used the same ASCII scheme to save English text.
Later, the computer development more and more widely, the world countries in order to be able to save their text in the computer, they decided to use the vacancy after 127th to express these new letters, symbols, but also added a lot of drawing tables need to use the bottom line, vertical line, cross and other shapes, has been numbered into the last state 255. The character set of this page from 128 to 255 is called "Extended character set". But the original numbering method, has no longer put more coding.
When the Chinese people get the computer, there is no available byte state to represent Chinese characters, and there are more than 6,000 commonly used Chinese characters need to be preserved. So the Chinese people on their own research and development, those 127th after the strange symbols are directly canceled off. Rule: A character less than 127 is the same as the original, but two more than 127 words connect prompt together, it represents a Chinese character, the preceding byte (which he calls a high byte) is used from 0xa1 to 0xf7, followed by a byte (low byte) from 0xa1 to 0xFE, This allows us to assemble about 7,000 + Simplified Chinese characters. In these codes, we also put mathematical symbols, Roman Greek alphabet, Japanese kana have been compiled into, even in ASCII, the number, punctuation, letters are all re-compiled two bytes long code, this is often said "full-width" character, and the original under 127th is called "Half-width" character.
The Chinese people see this is very good, so they call this scheme "Gb2312″." GB2312 is a Chinese extension to ASCII .
But Chinese characters too many, and later still not enough to use, so simply no longer require that the low byte must be 127th after the inner code, as long as the first byte is greater than 127 fixed indicates that this is the beginning of a Chinese character, whether followed by the expansion of the character set in the content. The result of the expanded coding scheme is called the GBK Standard, and GBK includes all the contents of the GB2312, while adding nearly 20,000 new Chinese characters (including traditional characters) and symbols. Later, the minority also to use the computer, so we expanded, and added thousands of new minority characters,GBK expanded into a GB18030. Since then, the Chinese nation's culture can be passed on in the computer age.
Because at that time each country like China to make a set of their own coding standards, the results of each other who do not know who the code, who does not support the code of others. At that time, the Chinese want to let the computer display Chinese characters, it must be loaded with a "Chinese character system", specifically to deal with the display of Chinese characters, input problems, installed the wrong character system, display will be messy set. What about this? At this point, an international organization called ISO (International Standard organization) decided to tackle the problem. Their approach is simple: to scrap all the regional coding schemes and to re-engage a code that includes all the cultures, letters and symbols on Earth! They intend to call it "Universal multiple-octet Coded Character Set", referred to as UCS, commonly known as "UNICODE".
When UNICODE began to develop, the memory capacity of the computer developed greatly, and space no longer became a problem. The ISO then directly stipulates that all characters must be represented uniformly by two bytes, or 16 bits, and for those "half-width" characters in ASCII, the UNICODE package holds its original encoding unchanged, only extending its length from the original 8 bits to 16 bits, while the characters of other cultures and languages are all re-encoded. Since the "Half-width" English symbol only needs to use the low 8 bits, so its high 8 bits is always 0, so this atmosphere of the scheme in the preservation of English text will be more than a waste of space.
However, Unicode is not designed to be compatible with any of the existing encoding schemes, which makes GBK and UNICODE completely different in the coding of Chinese characters, and there is no simple arithmetic method to convert text content from UNICODE encoding to another encoding. This conversion must be done by looking up a table. UNICODE is represented as a character in two bytes, and he can combine 65535 different characters in total, which may already cover all the cultural symbols of the world.
When Unicode comes along with the advent of computer networks, how Unicode is transmitted over the network is also a must-have issue, so many UTF (UCS Transfer Format) standards for transmission appear, as the name implies,UTF8 That is, each time 8 bits of data transmission, and UTF16 is 16 bits each time, but in order to transmit the reliability, from Unicode to UTF is not a direct correspondence, but to pass some algorithms and rules to convert.
After reading these, I believe that you have a few coding relationships, and so on, know more clearly. Let me summarize briefly:
After you create a new text document, enter "Unicom" two words in it and save it. When you open again, the original input "Unicom" will become two garbled characters.
The problem is that the GB2312 code and the UTF8 code generate a code collision. From the Internet a section of conversion rules from Unicode to UTF8:
1110xxxx 10xxxxxx 10xxxxxx
For example, the Unicode encoding of the word "Han" is 6c49. 6c49 between 0800-FFFF, so use a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. The 6c49 is written in binary: 0110 1100 0100 1001, dividing this bitstream into three-byte template segmentation method 0110 110001 001001, in turn instead of the template x, get: 1110-0110 10-110001 10-001001, that is, E6 B1 89, this is the code of its UTF8.
And when you create a new text file, Notepad encoding by default is ANSI, if you enter Chinese characters in ANSI code, then he is actually the GB series encoding, in this code, "Unicom" within the code is:
C1 1100 0001
AA 1010 1010
CD 1100 1101
A8 1010 1000
Have you noticed? The beginning of the 12th byte and 34th byte are all "110″" and "10″", which is exactly the same as the two-byte template in the UTF8 rule, so when I open Notepad again, Notepad mistakenly thinks this is a UTF8 encoded file, let's take the first byte 110 and the second byte 10 to remove, We get the "00001 101010″, and then you align, make up the leading 0, get the" 0000 0000 0110 1010″, sorry, this is the Unicode 006A, that is, the lowercase letter "J", and then two bytes after decoding with UTF8 0368 , this character is nothing. This is the only "unicom" two words of the file can not be displayed in Notepad normal reason.
By this problem, a lot of problems can be emitted. A more common problem is: I have saved the file as XX code, why each open, or the original YY code?! The reason is that, although you saved the XX code, but the system recognition, but mistakenly recognized in order to YY encoding, so still display as YY encoding. In order to avoid this problem, Microsoft has made a bill called BOM Head of things.Questions about the document BOM header
When you save a UTF-8 encoded file with software such as a WINDOWS-based Notepad, three invisible characters (0xEF 0xBB 0xBF, or BOM) are inserted where the file begins. It is a string of hidden characters that allows editors such as Notepad to identify whether the file is encoded in UTF-8. This will avoid the problem. For general files, this does not cause any trouble.
In doing so, there are also disadvantages, especially in Web pages. PHP does not ignore the BOM, so when you read, include, or reference these files, the BOM is used as part of the text at the beginning of the file. Depending on the characteristics of the embedded language, this string of characters will be executed directly (shown). As a result, even if the top padding of the page is set to 0, there is no way to keep the entire page close to the top of the browser, since there are 3 characters at the beginning of the HTML. If you are in the Web page, found by the unknown blank, etc., it is likely because the file has a BOM header caused. If you encounter this problem, save the file with the BOM header!How to view and modify the encoding of a document
1, view and modify directly using Notepad. We can open the file with Notepad and then click "Save As" in the top left corner, and a saved window will pop up. After selecting a good code below, click Save to do it.
But this option is very small, and is often used to quickly see what the file is encoded in. I recommend using the following method.
2, use a different text editor (for example: Notepad + +) to view the changes. almost all of the mature text editors (for example, Dreamweaver, EmEditor, etc.) can quickly view or modify the file encoding. This is particularly true of notepad++.
After you open a file, the encoding of the current file is displayed in the lower-right corner.
Click "Encoding" in the menu bar above to convert the current document to another encodingIE6 loading a CSS file BUG
When the encoding of the HTML file is inconsistent with the file that you want to load the CSS, IE6 will not be able to read the CSS file, i.e. the HTML file does not have a style. In my observation, the problem has never appeared in other browsers, only in IE6. Just put the CSS file, save the HTML file encoding can be.
Web page coding that's the thing.