I. ASCII code
We know that in a computer, all information is eventually represented as a binary string. Each binary bit has two states: 0 and 1. Therefore, eight binary bits can combine 256 states, which is called a byte ). That is to say, a single byte can be used to represent 256 different States. Each State corresponds to one symbol, that is, 256 symbols, from 00000000 to 11111111.
In the 1960s s, the United States developed a set of character codes to define the relationship between English characters and binary characters. This is called ASCII code, which has been used till now. The ASCII code consists of a total of 128 characters. For example, the space is 32 (Binary 00100000), and the uppercase letter A is 65 (Binary 01000001 ). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last seven digits of one byte, and the first one digit is set to 0.
In C #, if you want to see the ASCII code of a letter, you can use encoding, a character encoding class. The Code is as follows:
String S = "";
Byte [] ASCII = encoding. ASCII. getbytes (s );
We can see in the debugger that the ASCII value is 97, that is, the ASCII code of A is 97 (1100001)
Ii. Non-ASCII Encoding
It is enough to encode English with 128 symbols, but it is not enough to represent other languages. For example, if there is a phonetic symbol above a letter in French, it cannot be represented by ASCII code. As a result, some European countries decided to use the idle highest bit in the byte to encode the new symbol. For example, E in French is encoded as 130 (Binary 10000010 ). In this way, the encoding systems used by these European countries can represent a maximum of 256 symbols.
However, there are new problems. Different countries have different letters. Therefore, even if they all use 256 characters, they represent different letters. For example, 130 represents é in French encoding, but gimel in Hebrew encoding, and another symbol in Russian encoding. However, in all these encoding methods, 0-represents the same symbol, but the difference is only the 128-255.
As for Asian countries, more characters are used, and about 0.1 million Chinese characters are used. A single byte can only represent 256 types of symbols. It must be expressed by multiple bytes. For example, the common encoding method for simplified Chinese is gb2312, which uses two bytes to represent a Chinese character. Therefore, it can theoretically represent a maximum of 256x256 = 65536 characters. In C #, if you want to see the gb2312 encoding of a Chinese character, you can use the following code:
String S = "beam ";
System. Text. Encoding gb2312 = system. Text. encoding. getencoding ("gb2312 ");
Byte [] GB = gb2312.getbytes (s );
At this time, there are two numbers in GB: 193 (11000001), 186 (10111010)
Iii. Unicode
As mentioned above, there are multiple encoding methods in the world. The same binary number can be interpreted as different symbols. Therefore, to open a text file, you must know its encoding method. Otherwise, garbled characters may occur when you use an incorrect encoding method. Why do emails often contain garbled characters? It is because the sender and receiver use different encoding methods.
As you can imagine, if there is an encoding, all the symbols in the world will be included. Every symbol is given a unique encoding, so the garbled problem will disappear. This is Unicode, as its names all represent. This is the encoding of all symbols.
Unicode is, of course, a large collection. The current size can contain more than 1 million characters. Each symbol is encoded differently. If you want to see the Unicode encoding of a Chinese character in C #, you can use the following code:
String S = "beam ";
Byte [] Unicode = encoding. Unicode. getbytes (s );
At this time, Unicode contains two numbers: 129 (10000001), 104 (1101000)
Iv. Unicode Problems
It should be noted that Unicode is only a collection of symbols. It only specifies the binary code of the symbol, but does not specify how the binary code should be stored.
For example, the Unicode of the Chinese character "beam" is (110100010000001), that is, the representation of this symbol requires at least two bytes. It indicates other larger symbols. It may take 3 or 4 bytes, or even more.
There are two serious problems here. The first problem is, how can we distinguish Unicode and ASCII? How does a computer know that three bytes represent one symbol instead of three symbols? The second problem is that we already know that only one byte is enough for English letters. If Unicode is uniformly defined, each symbol is represented by three or four bytes, therefore, two to three bytes in front of each English letter must be 0, which is a huge waste for storage. Therefore, the size of the text file is two or three times larger, which is unacceptable.
The result is: 1) There are multiple Unicode storage methods, that is, there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long time until the emergence of the Internet.
5. UTF-8
With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used Unicode implementation method on the Internet. Other implementations also include UTF-16 and UTF-32, but are basically not needed on the Internet. Repeat, the relationship here is that UTF-8 is one of the Unicode implementation methods.
The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.
UTF-8 coding rules are very simple, only two:
1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the Unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.
2) for the n-byte symbol (n> 1), the first N bits of the first byte are set to 1, and the N + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all Unicode codes of this symbol.
UCS-2 Coding |
UTF-8 byte stream |
U-00000000-U-0000007F: |
0 xxxxxxx |
U-00000080-U-000007FF: |
110 XXXXX 10 xxxxxx |
U-00000800-U-0000FFFF: |
1110 XXXX 10 xxxxxx 10 xxxxxx |
U-00010000-U-001FFFFF: |
11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
U-00200000-U-03FFFFFF: |
111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
U-04000000-U-7FFFFFFF: |
1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
For example
We use code
String S = "beam ";
Byte [] Unicode = encoding. Unicode. getbytes (s );
Byte [] utf8 = encoding. utf8.getbytes (s );
You can see through the debugger
Here, the data in the memory is arranged from high to low, and the 104 hexadecimal system is 81 in the 68,129 hexadecimal system, that is, the Unicode of the "beam" is 6881 in the hexadecimal system, the binary value is 110100010000001. We can see from the above table that 6881 should belong to the third row (800-ffff ), therefore, the "beam" UTF-8 encoding requires three bytes, that is, the format is "1110 XXXX 10 xxxxxx 10xxxxxx ". Then, from the last binary bit of the beam, enter X in the format from the back to the front, and fill in 0 with the extra bit. In this way, the UTF-8 of the "beam" is "111001101010001010000001", according to No 8-bit conversion to decimal is 230,162,129. The value is exactly the same as that in utf8.
C # UTF-8 to gb2312
Net memory strings are Unicode, so the test program is not easy to write in the console application. Please write it by yourself according to the following code:
Public String utf8togb2312 (string Str)
{
Try
{
Encoding utf8 = encoding. getencoding (65001 );
Encoding gb2312 = encoding. getencoding ("gb2312"); // encoding. Default, 936
Byte [] temp = utf8.getbytes (STR );
Byte [] temp1 = encoding. Convert (utf8, gb2312, temp );
String result = gb2312.getstring (temp1 );
Return result;
}
Catch (exception ex) // (unsupportedencodingexception ex)
{
Response. Write (ex. tostring ());
Return NULL;
}
}
VII. Advantages of utf8
UTF-8 is the world's common language encoding, if other languages in the operating system to access the gb2312 encoding website, you need to download the Language Pack, so for the sake of the universality of the site, utf8 encoding is a better choice, but in comparison, gb2312 is less than the data obtained by the UTF-8.
8. garbled problem:
If there is a string in the memory, file, or email, you should know what encoding scheme it uses, otherwise it cannot be correctly interpreted or displayed to the user. If there is no equivalent content for the encoding scheme to be used, a small question mark "?" is usually displayed. Or a box is displayed. Net in the memory of the string is Unicode, and Asp.net program is UTF-8 encoding by default, we use some strings appear garbled, we first need to determine whether we interpret the encoding method error.
Summary of coding problems of some common software in website projects:
I. Dreamweaver
1. After the Simplified Chinese version is installed, the default gb2312 encoding format is used, which can be changed in preferences.
2. When a document is created, the code used will be automatically added to the Code Declaration. In fact, many people are confused here. In this software, when saving, it will automatically save according to the declared encoding, such as declaring charset = gb2312, then even if the current preference is selected UTF-8, however, it will still be saved as gb2312 according to this statement during storage. It is okay to automatically determine whether to save the statement without conflict. However, in many cases, if you don't want to open the software for a long time, you can simply open it in notepad and change the statement. At this time, the notepad won't automatically judge, the encoding is the same as the one used in the Declaration. This is the problem.
Therefore, as mentioned above, you do not need to pay attention to or declare what is stored.
Unified use of UTF-8 is good, will not produce so many conflict garbled Problem
Ii. Asp.net
After Asp.net is installed, the encoding of the UTF-8 is defaulted. This is why many projects are under development, and the foreground uses Dreamweaver and the background uses Asp.net to add programs. Everyone uses the default, gb2312 and UTF-8 Of course conflict.
You can change the ASP. NET encoding in the web. config configuration file.
Iii. Flash
From the Flash MX, flash began to support Unicode encoding, the default is Unicode encoding, it is almost close to the UTF-8, I gu and said it is the default UTF-8, do not want to say too deep, we only need to solve the problem.
When flash needs to load external text files, pages, and XML files, it needs to be interpreted and displayed.
1. If the external file is encoded in UTF-8, there is no need to change the flash, the default line
2. If an external file uses another encoding, such as gb2312, you must write it in the first action.
System. usercodepage = true;
// Use the traditional code page of the operating system running the player to interpret external text files
URL and HTML Encoding
When presenting HTML pages, you sometimes need to display special characters, such as "<" and "&", because they are specialized HTML characters, so you need some tips. for example, to display at&t, you must write AT & amp; T in the code. The =, &,/, and other characters in the URL are also special characters. Therefore, if you need to use them in the URL parameter, you must also edit these characters, you can add the hexadecimal character after "%" to replace these conflicting characters. For example, replace the space with % 20.
The following example shows how to display these special characters in HTML and URL, and how to display the source code of the current page.
Use the server. htmlencode () function
In the first example, for the same string (note that it contains an HTML Tag that the browser can recognize)
"This HTML code <SPAN class =/" redbolditalic/"> shocould appear </span> with the tags invisible"
We use two methods (using and not using htmlencode for string encoding) to display them on the web page to see what the effect is. The display effect is shown in. You can see that if the string is not encoded, when the browser parses it, if it encounters an HTML Tag that can be identified, it will render them as HTML, whether or not they are part of a string.
This default behavior of the browser is sometimes not what we want. Suppose we want to display <SPAN class =/"redbolditalic/"> shocould appear </span> in the browser. What should we do? It is very simple. We only need to use the htmlencode function to encode it.