[Original dry goods] the root cause of garbled web pages and the root cause of dry goods

Last Update:2017-10-11 Source: Internet

Author: User

Tags coding standards control characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Original dry goods] the root cause of garbled web pages and the root cause of dry goods

First look at the code segment:

<! Doctype html>

In the HTML code, <meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8"> specifies that the webpage is encoded as UTF-8.
Web Code involves many knowledge points. In general, it is also a historical issue.
The first computer (ENIAC) was born in the United States in February 1946. At that time, the United States only considered its own use, in the years after the birth of the computer, an American Standard Code for Information Interchange (American Standard Code for Information exchange) was developed. It is a computer coding system based on Latin letters, it is mainly used to display modern English and other Western European languages.
ASCII Code uses a combination of 8-bit binary numbers to represent 256 possible characters (Power 8 of 2 = 256), including uppercase and lowercase letters, numbers 0 to 9, punctuation marks, and special control characters used in American English. One character occupies 1 byte. The ASCII code table is encoded as follows:

An HTML Escape Character (character entity). For example, the escape character of the symbol "<" is "& lt;" or "& #60 ;", the number "60" indicates the serial number 60th of the ASCII code table. Similarly, the uppercase letter "K" can also be escaped as "& #75 ;".
Let's use the Escape Character for an experiment:

The American ASCII code can meet all the characters and expressions in the computer field. However, this is only the meaning of the United States. After all, all English words can be split from 26 English letters, and the ASCII code table can express 256 characters, which is indeed enough for the United States.
Later, computers were used all over the world. The languages in many countries were not English, and the texts in these countries were not included in the ASCII code table. Taking China as an example, there are nearly 0.1 million Chinese characters, which cannot be placed into the ASCII code table. Therefore, our country expanded the ASCII code table and formed its own set of standards. In the standard, one Chinese Character occupies two bytes, and the new code table can express 65536 Chinese characters. However, at the beginning, the code table was not fully filled and used. It only contains more than 6000 commonly used Chinese characters, English letters, and other symbols. This set of standards is known as GB2312 (Chinese character encoding character set for information exchange, GB is the abbreviation of pinyin for the simplified term "National Standard", and 2312 is the national standard serial number ). Later, a set of standards was developed to include more Chinese characters (more than 20 thousand Chinese characters are included), known as GBK (Chinese character encoding extension specification, K is the first letter of the expanded pinyin ).
In GB2312 or GBK, many punctuation marks are reencoded using two bytes, these two-byte punctuation marks are called "fullwidth" characters ("fullwidth" or "fullwidth "), in the original ASCII code table, punctuation marks that take up one byte are called "halfwidth" characters ("halfwidth" is also called "half shape" or "half width" or "Half Code "). The full-width commas, Parentheses, periods, and so on are different from the half-width ones:

In the Chinese Input Method, the default punctuation is full-width characters; in the English input method, punctuation is half-width characters.
Let's continue with the story: as more and more computers are used, more and more countries are developing their own computer coding standards. As a result, computers in different countries do not support or recognize each other. For example, to display Chinese Characters in American computers, you must install the Chinese character system. Otherwise, Chinese files are garbled in American computers.
In this way, an International Organization named ISO (International Organization for Standardization, International Standardization Organization) was created during this period to address coding problems in various countries. ISO has produced a unified encoding scheme called UNICODE (Uniform Code, Universal code, single code, and Universal Multiple-Octet Coded Character Set, also referred to as UCS, used to include all texts and symbols on the earth. UNICODE characters are divided into 17 groups, each of which is called a Plane. Each Plane has 65536 yards and can contain a total of 1114112 characters (1.11 million characters, capacity ). UNICODE encoding occupies 2 bytes for a single character.
However, UNICODE cannot be promoted for a long period of time until the emergence of the Internet, and data transmission and exchange make the encoding between countries unified into an urgent need. However, the hard disk and network traffic in the early days were very expensive, and each character in UNICODE encoding occupied 2 bytes of capacity, so in order to save the hard disk space occupied during file storage, in order to save the network traffic occupied by characters during network transmission, many UNICODE-based and transmission-oriented standards have been developed. These transmission-oriented standards are collectively referred to as UTF (uctransfer Format ). UNICODE encoding and UTF Encoding do not directly correspond to each other, but must be converted using some algorithms and rules. The relationship between UNICODE and UTF is that UNICODE is fundamental, basic, and objective, while UTF is only a means, method, and process for UNICODE implementation.
Common UTF formats are: UTF-8, UTF-16, UTF-32. Among them, UTF-8 is the most widely used UNICODE implementation method on the Internet, it is designed for transmission. Because UTF-8 is a UNICODE-based transmission implementation, it can make the code without borders, any country's text can be properly displayed in any country's Computer Browser. One of the biggest characteristics of UTF-8 is: it is a variable length encoding method, it can use 1 ~ Four bytes indicate one symbol. The length of the byte varies according to different symbols. When one byte can be used to represent one symbol, one byte is used to represent it, if a two-byte symbol is required, it is expressed by two bytes, and so on until four bytes, thus saving the hard disk storage space and network traffic.
Therefore, if our website uses GB2312 or GBK encoding during development, and the computer in other countries does not support Chinese character encoding, we will see garbled characters, which are shown as follows. If the website uses UTF-8 encoding, the content of the computer in any country will be automatically converted to UNICODE encoding when the website is opened, and because the current computer supports UNICODE encoding, so that any text can be displayed normally!
However, many websites in China still use GB2312 or GBK encoding. Such websites generally only provide services to domestic users, but do not display problems to domestic users. However, if the website is opened to visitors from other countries, it will be garbled to a large extent.
For high compatibility and internationalization of the site, it is recommended that the site use UTF-8 encoding instead of GB2312 or GBK encoding.
The labels that specify the web page as UTF-8, GB2312, and GBK are:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><meta http-equiv="Content-Type" content="text/html; charset=gbk">

So there is a problem: what is the difference between the various encoding of the webpage? Is it just the difference in the settings of this line of meta tags? Is it just the "small difference" that the "UTF-8" five characters are replaced with the "gb2312" six characters?
No, the difference is not just the difference between these characters. When the meta tag in a webpage is encoded as UTF-8, DreamWeaver automatically saves the webpage file as UTF-8 encoding when saving the webpage (the binary code uses the UTF-8 encoding format ), the UTF-8 encoding in the meta tag is used to tell the browser that the webpage uses UTF-8 encoding. parse and present the webpage in the UTF-8 encoding format during display; if the encoding format specified in the meta tag is gb2312, DreamWeaver automatically saves the webpage file as the gb2312 encoding format when saving the webpage (the binary code uses the gb2312 encoding format). Similarly, the gb2312 encoding in the meta tag is only used to tell the browser that the webpage uses gb2312 encoding. Please parse and present the webpage in the gb2312 encoding format during display. Let's do a test, save a text file as UTF-8 format (Open notepad new text file, enter the content, choose menu: file → Save As, encoding to select as UTF-8) and gb2312 format (in another memory encoding mode, ANSI stands for the default encoding of the current operating system. In simplified Chinese Windows operating systems, ANSI stands for GBK encoding; in traditional Chinese Windows, ANSI represents Big5; in Japanese Windows, ANSI represents Shift_JIS, and so on. Here we use the UltraEdit-32 file editor to perform a hexadecimal view of the text file, that is, to view the binary data of the file in hexadecimal:

We can see that the binary data of files stored using UTF-8 encoding and gb2312 encoding is different, that is, the binary data of these two files is different. When opening a text file, the notepad software will try to identify the file encoding and parse and display it, that is, the text is saved in notepad, whether saved as UTF-8 or gb2312 encoding, generally, notepad can recognize and display the file normally. You do not need to record the data in the file to inform notepad of the encoding of the file. However, many software programs cannot intelligently identify the encoding of text files, which requires that some special content (additional data) must be included when saving text files) to inform the file of the encoding. In the UNICODE specification, there is a concept of BOM (Byte Order Mark), that is, Byte Order Mark, which writes three bytes (ef bb bf) at the beginning of the file header) to inform the file that it is in UTF-8 encoding format. However, this BOM raises a new problem: not all software or processing programs support BOM, that is, not all software or processing programs can recognize the beginning of the file (ef bb bf) these three bytes. When it does not support identification, these three bytes are treated as the actual data content of the file. The early Firefox does not support BOM identification. In the case of BOM, special garbled characters will be displayed for these three bytes. So far, the PHP processing program still does not support BOM, that is, when a PHP file is saved as UTF-8, if BOM is included, the PHP processing program will resolve the BOM to the actual data content of the PHP file, resulting in an error! In DreamWeaver, select the software header menu: Modify → page properties (you can also press the shortcut key ctrl + j), and click "title/encoding" in the pop-up page property panel ", you can see the available encoding. This method is usually used to change the webpage encoding. For example:

Therefore, when we set the meta tag to UTF-8 encoding format, webpage files must be stored in UTF-8 format so that the browser can normally display webpages rather than garbled characters. If the UTF-8 encoding format is set in the meta tag, but the webpage file is saved as gbk or another format, the browser will receive a notification of the format in the meta tag when the webpage is opened: the UTF-8 encoding format is used to parse and display webpages, while the binary code (data content) of webpages is gbk encoding or other formats, which can be garbled! This is like a blind date. If the information in the matchmaker's hand is incorrect, inform the male: The female speaks English (the meta tag is set to UTF-8 encoding ). As a result, the woman does not understand English (the file is not UTF-8 encoded ). The man said "Hello" to make the woman unaware of the so-called (garbled ).
Let's experiment, the webpage specifies that the meta tag is encoded as UTF-8, but the file is saved in gbk format: We first use DreamWeaver to edit and save a webpage in UTF-8 format, and then use NotePad to open the webpage, save as, and select ANSI as the encoding.

<! Doctype html>

The execution result in the browser is as follows:

To sum up:When developing a webpage, try to use the UTF-8 encoding format and save the file as UTF-8 encoding.(When dreamweaver saves a webpage file, it uses <meta http-equiv = "Content-Type" content = "text/html; charset = encoding "> the specified encoding is automatically saved as the correct corresponding encoding. However, if you use other website code editors, such as Notepad and Editplus, you must note that, select the correct encoding when saving the file ).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More