A brief analysis of coding

Source: Internet
Author: User
Tags coding standards

Coding has always been a headache for the novice, especially GBK, GB2312, UTF-8 the three more common differences in the page encoding, but also let many novice disoriented, how to explain the explanation is not clear. But coding is so important, especially in this piece of Web page. If you call out is not garbled, and the Web page garbled, most of the reasons are on the code. In addition to garbled, there will be some other problems (such as: IE6 CSS loading problem) and so on. The purpose of my writing this article is to thoroughly explain this coding problem! If you're having a similar problem, take a closer look at this post.

ANSI, GBK, GB2312, UTF-8, GB18030, and UNICODE

These few coding keywords are more common, although I put us together to say, but does not mean that these things are a lateral relationship. This part of the content, quoted from the network slightly modified, do not know the source of the original, it can not be signed.

A long time ago, a group of people decided to use 8 transistors that could be opened and closed to form different states to represent everything in the world, which they called "bytes". Later, they did some of the machines that can handle these bytes, the machine started, can use bytes to assemble a lot of states, the state began to change, they called the machine "computer."

Start computer is only used in the United States. A total of eight bytes can be combined with 256 (2 of 8) different states. They have a number from 0 to start the 32 states have specified a special purpose, but the terminal, the printer when the agreed-upon these bytes are transmitted, it is necessary to do some of the agreed action. Meet 00x10, terminal on line, meet 0x07, terminal on people toot called, example Good met 0x1b, printer on the print anti-white word, or terminal on the color display letters. They see this very well, so they call these 0x20 byte states as "control codes".

They put all the blanks, punctuation marks, numbers, uppercase and lowercase letters in a contiguous byte state, all the way up to the 127th, so the computer can use different bytes to store English text. We see this, all feel very good, so we all call this plan  ansi style = "border:0px; margin:0px; padding:0px ">ascii " encoding (American Standard Code for Information Interchange, U.S. Information Interchange standards codes). All the computers in the world used the same ascii scheme to save English text.

Later, the computer development more and more widely, the world countries in order to be able to save their text in the computer, they decided to use the vacancy after 127th to express these new letters, symbols, but also added a lot of drawing tables need to use the bottom line, vertical line, cross and other shapes, has been numbered into the last state 255. The character set of this page from 128 to 255 is called "Extended character set". But the original numbering method, has no longer put more coding.

When the Chinese people get the computer, there is no available byte state to represent Chinese characters, and there are more than 6,000 commonly used Chinese characters need to be preserved. So the Chinese people on their own research and development, those 127th after the strange symbols are directly canceled off. Rule: A character less than 127 is the same as the original, but two more than 127 words connect prompt together, it represents a Chinese character, the preceding byte (which he calls a high byte) is used from 0xa1 to 0xf7, followed by a byte (low byte) from 0xa1 to 0xFE, This allows us to assemble about 7,000 + Simplified Chinese characters. In these codes, we also put mathematical symbols, Roman Greek alphabet, Japanese kana have been compiled into, even in ASCII, the number, punctuation, letters are all re-compiled two bytes long code, this is often said "full-width" character, and the original under 127th is called "Half-width" character.

gb2312   is  ascii   's Chinese extension.

But there are too many Chinese characters, then it is not enough to use, so simply no longer require that the low byte must be 127th after the inner code, as long as the first byte is greater than 127 fixed indicates that this is the beginning of a Chinese character, whether followed by the expansion of the character set in the content. The resulting extended encoding scheme is known as  GBK   Standard, GBK includes all the contents of GB2312, At the same time, nearly 20,000 new Chinese characters (including traditional characters) and symbols have been added. Later, the minority also to use the computer, so we expanded, and added thousands of new minority characters, GBK   expanded into  gb18030 . Since then, the Chinese nation's culture can be passed on in the computer age.

iso   (International Standard organization) decided to tackle the problem. Their approach is simple: to scrap all the regional coding schemes and to re-engage a code that includes all the cultures, letters and symbols on Earth! They intend to call it "Universal multiple-octet Coded Character Set", referred to as UCS, commonly known as " UNICODE ".

When UNICODE began to develop, the memory capacity of the computer developed greatly, and space no longer became a problem. The ISO then directly stipulates that all characters must be represented uniformly by two bytes, or 16 bits, and for those "half-width" characters in ASCII, the UNICODE package holds its original encoding unchanged, only extending its length from the original 8 bits to 16 bits, while the characters of other cultures and languages are all re-encoded. Since the "Half-width" English symbol only needs to use the low 8 bits, so its high 8 bits is always 0, so this atmosphere of the scheme in the preservation of English text will be more than a waste of space.

However, Unicode is not designed to be compatible with any of the existing encoding schemes, which makes GBK and UNICODE completely different in the coding of Chinese characters, and there is no simple arithmetic method to convert text content from UNICODE encoding to another encoding. This conversion must be done by looking up a table. UNICODE is represented as a character in two bytes, and he can combine 65535 different characters in total, which may already cover all the cultural symbols of the world.

When Unicode comes along with the advent of computer networks, how Unicode is transmitted over the network is also a must-have issue, so many UTF (UCS Transfer Format) standards for transmission appear, as the name implies,UTF8 That is, each time 8 bits of data transmission, and UTF16 is 16 bits each time, but in order to transmit the reliability, from Unicode to UTF is not a direct correspondence, but to pass some algorithms and rules to convert.

After reading these, I believe that you have a few coding relationships, and so on, know more clearly. Let me summarize briefly:

Chinese people through the extension of the ASCII encoding of Chinese, produced a GB2312 code, can represent more than 6,000 commonly used Chinese characters.

Chinese characters are too many, including traditional and various characters, so produced the GBK code, which includes the GB2312 in the code, while expanding a lot.

China is a multi-ethnic country, almost all nationalities have their own independent language system, in order to represent those characters, continue to expand the GBK code to GB18030 encoding.

Every country, like China, encodes its own language, so there are a variety of codes, and if you do not install the appropriate code, you cannot explain what the code wants to say.

Finally, there is an organization called ISO can not look down. Together they create a coded UNICODE that is large enough to accommodate any word or symbol in the world. So as long as the computer has Unicode encoding system, no matter what kind of text in the world, only need to save the file, save the Unicode encoding can be normal interpretation by other computers.

UNICODE in the network transmission, there are two standard UTF-8 and UTF-16, each transmitting 8 bits and 16 bits.

So there will be some doubts, UTF-8 since can save so many words, symbols, why there are so many people using GBK and other code? Because UTF-8 and other coding volume is relatively large, accounting for more computer space, if the use of the majority of people are Chinese, using GBK and other codes can also. But the current computer, hard disk is the price of cabbage, computer performance has been enough to ignore this performance of the consumption. So recommend all the Web pages using the Unified code:UTF-8.

Questions about Notepad cannot save "Unicom" separately

After you create a new text document, enter "Unicom" two words in it and save it. When you open again, the original input "Unicom" will become two garbled characters.

The problem is that the GB2312 code and the UTF8 code generate a code collision. From the Internet a section of conversion rules from Unicode to UTF8:

UTF-8

0000–007f

0xxxxxxx

0080–07ff

110xxxxx 10xxxxxx

0800–ffff

1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode encoding of the word "Han" is 6c49. 6c49 between 0800-FFFF, so use a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. The 6c49 is written in binary: 0110 1100 0100 1001, dividing this bitstream into three-byte template segmentation method 0110 110001 001001, in turn instead of the template x, get: 1110-0110 10-110001 10-001001, that is, E6 B1 89, this is the code of its UTF8.

And when you create a new text file, Notepad encoding by default is ANSI, if you enter Chinese characters in ANSI code, then he is actually the GB series encoding, in this code, "Unicom" within the code is:

C1 1100 0001

AA 1010 1010

CD 1100 1101

A8 1010 1000

Did you notice? The beginning of the 12th byte and 34th byte are all "110″" and "10″", which is exactly the same as the two-byte template in the UTF8 rule, so when I open Notepad again, Notepad mistakenly thinks this is a UTF8 encoded file, let's take the first byte 110 and the second byte 10 to remove, We get the "00001 101010″, and then you align, make up the leading 0, get the" 0000 0000 0110 1010″, sorry, this is the Unicode 006A, that is, the lowercase letter "J", and then two bytes after decoding with UTF8 0368 , this character is nothing. This is the only "unicom" two words of the file can not be displayed in Notepad normal reason.

By this problem, a lot of problems can be emitted. A more common problem is: I have saved the file as XX code, why each open, or the original YY code?! The reason is that, although you saved the XX code, but the system recognition, but mistakenly recognized in order to YY encoding, so still display as YY encoding. In order to avoid this problem, Microsoft has made a bill called BOM Head of things.

Questions about the document BOM header

When you save a UTF-8 encoded file with software such as a WINDOWS-based Notepad, three invisible characters (0xEF 0xBB 0xBF, or BOM) are inserted where the file begins. It is a string of hidden characters that allows editors such as Notepad to identify whether the file is encoded in UTF-8. This will avoid the problem. For general files, this does not cause any trouble.

In doing so, there are also disadvantages, especially in Web pages. PHP does not ignore the BOM, so when you read, include, or reference these files, the BOM is used as part of the text at the beginning of the file. Depending on the characteristics of the embedded language, this string of characters will be executed directly (shown). As a result, even if the top padding of the page is set to 0, there is no way to keep the entire page close to the top of the browser, since there are 3 characters at the beginning of the HTML. If you are in the Web page, found by the unknown blank, etc., it is likely because the file has a BOM header caused. If you encounter this problem, save the file with the BOM header!

How to view and modify the encoding of a document

1, view and modify directly using Notepad. We can open the file with Notepad and then click "Save As" in the top left corner, and a saved window will pop up. After selecting a good code below, click Save to do it.

But this option is very small, and is often used to quickly see what the file is encoded in. I recommend using the following method.

2, use a different text editor (for example: Notepad + +) to view the changes. almost all of the mature text editors (for example, Dreamweaver, EmEditor, etc.) can quickly view or modify the file encoding. This is particularly true of notepad++.

After you open a file, the encoding of the current file is displayed in the lower-right corner.

Click "Encoding" in the menu bar above to convert the current document to another encoding

IE6 loading a CSS file BUG

When the encoding of the HTML file is inconsistent with the file that you want to load the CSS, IE6 will not be able to read the CSS file, i.e. the HTML file does not have a style. In my observation, the problem has never appeared in other browsers, only in IE6. Just put the CSS file, save the HTML file encoding can be.

A brief analysis of coding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.