Encoding has always been a headache for new users, especially GBK, GB2312, UTF-8 these three more common differences in Web code, but also let many novice dizzy, how to explain is not clear. But encoding is so important, especially on the web page. If the code you typed is not garbled, And there is garbled code on the webpage, the vast majority of the reasons are coding. In addition to garbled characters, there are also some other problems (such as the CSS Loading Problem of IE6. The purpose of this article is to thoroughly explain this encoding problem! If you encounter a similar problem, take a closer look at this article.
ANSI, GBK, GB2312, UTF-8, GB18030, and UNICODE
These encoding keywords are quite common. Although I put them together, they do not mean that these things are in a hierarchical relationship. The content in this section is slightly modified from the network and cannot be signed because it does not know the source of the original article.
A long time ago, a group of people decided to combine eight open and closed transistors into different States to represent everything in the world. They called it byte ". Later, they made some machines that can process these bytes. The machines started and can combine many States with bytes, and the State began to change, they call this machine a "computer ".
The computer is only used in the United States. The eight-bit bytes can combine a total of 256 (8 power of 2) different states. They specify 32 states with numbers starting from 0 for special purposes. Once these bytes are transmitted by a terminal or printer, some agreed actions are required. When 00 × 10 is met, the terminal will wrap the line. When 0 × 07 is met, the terminal will beep to people. For example, when 0x1b is met, the printer will print the reversed words, or the terminal displays letters in color. They see this very well, so they call these byte States below 0 × 20 as "control codes ".
All the spaces, punctuation marks, numbers, and upper and lower case letters are represented in a continuous byte state, which has been compiled to the 127th, in this way, the computer can store English text in different bytes. As you can see, it feels good, so everyone calls this solutionANSI"AsciiAmerican Standard Code for Information Interchange ). At that time, all the computers in the world used the sameASCIISolution to save the English text.
Later, computers grew wider and wider. In order to store their texts on computers, countries around the world decided to use spaces after 127 to express these new letters and symbols, we also added the following horizontal lines, vertical lines, and cross shapes to draw tables, and sorted the sequence numbers to the last state by 255. The character set on the page from 128 to 255 is called the extended character set ". However, no more encoding can be placed in the original numbering method.
When Chinese people get computers, there is no byte State available to represent Chinese characters. Moreover, more than 6000 common Chinese characters need to be saved. As a result, Chinese people independently developed and removed the singular symbols after the 127 S. Rule: A character smaller than 127 has the same meaning as the original character, but when two more than 127 characters are connected together, it indicates a Chinese character, the first byte (also known as a high byte) 0xF7 is used from 0xA1, And the next byte (low byte) is from 0xA1 to 0xFE, so that we can combine more than 7000 simplified Chinese characters. In these codes, we also compiled the mathematical symbols, the Greek letters in Rome, and the Japanese Kana, the original numbers, punctuation marks, and letters in ASCII are all re-encoded in two bytes. This is often referred to as the "fullwidth" character, those earlier than 127 are called "halfwidth" characters.
The Chinese people see this very well, so they call this Chinese Character Scheme "GB2312 ″.GB2312YesASCII.
However, there are too many Chinese Characters in China, and they are not enough to be used later. Therefore, the low byte must be the internal code after 127, as long as the first byte is greater than 127, it indicates that this is the beginning of a Chinese character, no matter whether it is followed by the content in the extended character set. The extended encoding scheme is calledGBKGBK includes all the contents of GB2312, and adds nearly 20000 New Chinese characters (including traditional Chinese characters) and symbols. After that, ethnic minorities also needed computers, so we expanded and added thousands of new ethnic minorities,GBKExpandedGB18030. Since then, the Chinese culture can be inherited in the computer age.
At that time, various countries developed their own coding standards like China, and no one knows each other's coding results, and no one supports others' coding. When Chinese people wanted computers to display Chinese characters, they had to install a "Chinese character system" to handle Chinese Character Display and input problems. The wrong character system was installed, the display will be messy. What should I do? At this moment,ISOThe International Organization for Standardization has decided to address this issue. The method they use is simple: discard all the regional encoding schemes and re-develop a code that includes all the cultures, letters, and symbols on the earth! They plan to call it "Universal Multiple-Octet Coded Character Set", or "UCS"UNICODE".
When UNICODE was formulated, the memory capacity of the computer was greatly increased, and the space was no longer a problem. Therefore, ISO directly requires that all characters must be expressed in two bytes, that is, 16 bits. For the "halfwidth" characters in ascii, the UNICODE package remains unchanged in its original encoding, it only expands its length from the original 8-bit to 16-bit, while all characters in other cultures and languages are re-encoded. Because the "half-width" English symbol only requires 8 lower digits, its 8-bit height is always 0. Therefore, this atmospheric solution will waste more space than twice when saving the English text.
However, UNICODE is not considered to be compatible with any existing encoding scheme, which makes the GBK and UNICODE completely different in the internal code orchestration of Chinese characters, there is no simple arithmetic method to convert text content from UNICODE encoding to another encoding. This conversion must be performed through the table store. UNICODE is expressed as a character in two bytes. It can combine a total of 65535 different characters, which may already cover all the symbols of culture in the world.
When UNICODE came, it came along with the rise of computer networks. How UNICODE is transmitted over the network is also an issue that must be considered. Therefore, the transmission-oriented UTF (uctransfer Format) standard emerged, as the name implies,UTF8That is, data is transmitted in eight places each time, andUTF16It is 16 digits each time, but for the reliability during transmission, the conversion from UNICODE to UTF is not a direct correspondence, but requires some algorithms and rules.
After reading this, I believe you have a clear understanding of these encoding relationships. Here is a brief summary:
● The Chinese people expanded the Chinese ASCII code to generate GB2312 encoding, which can represent more than 6000 common Chinese characters.
● There are too many Chinese characters, including traditional Chinese characters and various characters. Therefore, GBK encoding is generated. It includes the encoding in GB2312 and expands a lot at the same time.
● China is a multi-ethnic country and almost all ethnic groups have their own independent language systems. To express those characters, we continue to extend the GBK encoding to GB18030 encoding.
● Every country uses its own language encoding like China, so there are various encodings. If you do not install the corresponding encoding, you will not be able to explain the content that the corresponding encoding wants to express.
● At last, an organization named ISO couldn't stand it anymore. Together, they created an encoding UNICODE, which is very large and can accommodate any text and symbols in the world. So as long as there is a UNICODE encoding system on the computer, no matter which type of text is global, you only need to save the file as UNICODE encoding and it can be properly interpreted by other computers.
● UNICODE in network transmission, there are two standard UTF-8 and UTF-16, each transmission of 8 and 16 individual.
So someone will have doubts, since the UTF-8 can save so many characters, symbols, why there are so many domestic use GBK encoding people? Because the size of UTF-8 encoding is relatively large, accounting for more computer space, if the use of the majority of people are Chinese, GBK encoding can also be used. However, for the current computer, hard disks are all cabbage prices, and the computer performance is enough to ignore this performance consumption. Therefore, we recommend that you use uniform encoding for all webpages:UTF-8.
Regarding the issue that notepad cannot be separately saved as "Unicom"
After creating a new text document, enter the word "Unicom" in it and save it. When you open it again, the original "Unicom" will become two garbled characters.
This problem is caused by the collision between GB2312 encoding and UTF8 encoding. A Conversion rule from UNICODE to UTF8 is drawn from the Internet:
UTF-8
0000-007F
0 xxxxxxx
0080-07FF
110 xxxxx 10 xxxxxx
0800-FFFF
1110 xxxx 10 xxxxxx 10 xxxxxx
For example, the Unicode code of the Chinese character is 6C49. 6C49 is between 0800-FFFF, so the 3-byte template is used: 1110 xxxx 10 xxxxxx 10 xxxxxx. Write 6C49 as binary: 0110 1100 0100 1001. Divide the bit stream into 0110 110001 001001 Based on the segmentation method of the Three-byte template, and replace x in the template in sequence: 1110-0110 10-110001 10-001001, that is, E6 B1 89, which is the UTF-8 encoding.
When you create a new text file, the notepad encoding is ANSI by default. If you input Chinese characters in the ANSI encoding, it is actually the encoding method of the GB series. In this encoding, the inner code of China Unicom is:
C1 1100 0001
Aa 1010 1010
Cd 1100 1101
A8 1010 1000
Have you noticed? The start part of the first two bytes and the third four bytes is "110" and "10", which is exactly the same as the two-byte template in the UTF8 rule, so when you open notepad again, notepad mistakenly believes that this is an UTF8 encoded file. Let's remove 10 of the first byte's 110 and the second byte, And we get "00001 101010", and then align you with each other, add the leading 0 to get "0000 0000 0110 1010". Sorry, this is the UNICODE 006A, that is, the lowercase letter "j ", the second two bytes are decoded by UTF8, and the character is nothing. This is why files with only the word "Unicom" cannot be normally displayed in the notepad.
This issue can cause many problems. A common problem is that I have saved the file as XX encoding. Why is it the original YY encoding ?! The reason is that, although you saved it as XX encoding, it was mistakenly recognized as YY encoding during system identification, so it is still displayed as YY encoding. To avoid this problem, MicrosoftBOMHeader.
Questions about the file BOM Header
When using software such as WINDOWS notepad, when saving a file encoded in UTF-8, three invisible characters (0xEF 0xBB 0xBF, (BOM ). It is a string of hidden characters, used for the notepad editor to identify whether the file is encoded in UTF-8. In this way, this problem can be avoided. For general files, this will not cause any trouble.
This also has some disadvantages, especially in the web page. PHP does not ignore the BOM. Therefore, when reading, including, or referencing these files, the BOM is used as part of the Beginning body of the file. According to the characteristics of the embedded language, this string of characters will be directly executed (displayed. As a result, even if the top padding of the page is set to 0, the whole web page cannot be placed close to the top of the browser, because the three characters at the beginning of html are contained. If you find an unknown blank in the webpage, it is likely that the file has a BOM header. In this case, do not include the BOM header when saving the file!
How to view and modify the encoding of a document
1. Use notepad to view and modify the settings.You can open the file in notepad and click "file" = "Save as" in the upper left corner. A save window is displayed. After selecting the encoding below, click Save.
However, there is a small choice for this method, which is usually used to quickly view the encoding of a file. I recommend the following method.
2. Use another text editor (for example, notepad ++) to view the changes.Almost all mature text editors (such as Dreamweaver and Emeditor) can quickly view or modify the file encoding. This is especially reflected in notepad ++.
After opening a file, the encoding of the current file is displayed in the lower right corner.
Click "encoding" in the top menu bar to convert the current document to another encoding.
The CSS file loading BUG in IE6
When the encoding of an HTML file is different from that of the file to which the CSS file is to be loaded, IE6 cannot read the CSS file, that is, the HTML file does not have a style. As far as I observe, this problem has never appeared in other browsers, but in IE6. You only need to save the CSS file as the encoding of the HTML file.