ASCII, GB2312, GBK, UTF-8 encoding:
Original address: HTTP://HI.BAIDU.COM/PHPEASE/ITEM/F450B5CAEC143014505058FC
variable naming rules:
1. Variable names are case-sensitive (case-sensitive).
2. Must begin with a letter or underscore. Variable names can consist of letters, numbers, and underscores.
See here may, a lot of people wonder ~. So why $ I am a variable such Chinese can also be a variable name. In PHP, Chinese can do variable names (can be used in the use of but do not use the project ...) 。 Because the letters here refer to: A-Z and extended ASCII characters character from 127 to 255, and 16 is represented as: 0x7f-0xff. That means that the ASCII character character from 127 to 255 (0X7F-0XFF) in Chinese. That's true. Here is a simple code.
ASCII, GB2312, GBK, UTF-8 encoding:
128 characters are included in the ASCII code. expressed in decimal 0 to 127来. That's right, 0 to 127 is 128 characters. Each digit represents a character. Look at the ASCII coding table
Let's look at the decimal (DEC) column first, see. 0 corresponds to null (the end delimiter of the string), and the decimal number 9 corresponds to the character that we develop the most used tab key. Then look at the 48 corresponding character is 0. Yes, from ASCII 48 to 57, it means the number 0 to 9. ASCII 97 through 122 represents the lowercase letters A through Z. For example: We see the letter A, in fact, the computer does not know what letter A, she only know 97. She turned a into an ASCII 97来 for storage. Here's a PHP game to play with ASCII.
Here we know two functions:
ORD----return ASCII value of character returns the ASCII values of the character channeling.
CHR----return a specific character returns the corresponding ASCII character.
Cui Flower on Example:
<?php Echo (Ord (' A '));? >
That's right, the ASCII of lowercase a is 97. To print the ASCII 97 corresponding characters:
<?php Echo (Chr (97));? >
Well. After reading the basic understand. The ASCII code includes uppercase and lowercase numerals and some commonly used control characters. In this way in the use of English in the country basic can be used. The computer is storing ASCII. What people see is the corresponding character of ASCII.
English is not all the language in the world. For example, China is used in Chinese. Japanese is used in small Japan. Korea is used in Korean. These languages are not the same as English at all. Do you see a "number" on the ASCII form that corresponds to Chinese? No, it's not. Because there is also a GB2312 code table, and ASCII coding table. Links: http://wenku.baidu.com/view/244e2d2ce2bd960590c677a6.html Everyone open a look, ah ~ ~. Whether a little disorderly, can't find the clue, what mess. But when you understand the principle, it's easy to understand.
In the GB2312 code, we need to store and represent a character in two bytes. We remember that a single character in ASCII code requires only one byte. So storing data in GB2312 is one-fold larger than ASCII. So GB2312 these two bytes, what number can be placed to represent the letter A? We know that ASCII encoding A is represented by a byte, encoding 97. GB2312 encoding is a bit more complicated than ASCII,
In order to understand the GB2312 coding table, we should first study the "Location code."
Location Code concept:
GB2312 a "Partition" of Chinese characters and other characters (letters, numbers, etc.).
Section 01-09 is a special symbol (number, letter, etc.).
16-55 is a class of Chinese characters, sorted by pinyin.
The 56-87 area is a class two Chinese character, sorted by radical/stroke.
What the partition is. For example, I am Guangxi, you are Henan, he is Guangdong. In other words, each character must exist in a certain area. This type of representation is called "Location Code". The location code is actually the area code and the bit number (which indicates the first column of a character in this area). You want to know the location code of each Chinese character. Simple. I'll get you a link, check http://www.jscj.com/index/gb2312.php.
We use the word "ah" to check its location code. The above figure 1601 is "ah" the location code of the Chinese character. The area code is 16, bit number 01. If you remember, area 16 is a class of Chinese characters oh. Ox x AH. What is the meaning of the first class Chinese character? I reckon it's a common Chinese character. ~ I do not know, this is the definition of domestic experts. The number is 01, the number is actually said that you are in this area who rank. With the X axis (area code) and the y-axis (bit number) so naturally there is another intersection, through the intersection point can be found in the GB2312 code table corresponding characters.
If you go to see GB2312 's Code table now, I reckon you still can't understand, although say by location code can locate to GB2312 character of code. But the GB2312 code table is not so simple. and continue to look down.
Above we are talking about the location Code of GB2312: area codes and bit numbers. as I said earlier, a GB2312 character is expressed in two bytes: (High byte, low byte). The first byte is called "High byte", the second byte is called "Low byte" PS: Because high general row left bar ~ so called high byte ...
The algorithm is as follows: a GB2312 character = = (0xa0 + area code, 0XA0 + bit number). According to this algorithm, you can take a look at the GB2312 code table, you will be a duck.
What does 0xa0 mean. Why high byte equals area code plus 0xa0. Why is the low-order byte equal to the bit number plus 0xa0? This combination of two bytes can represent a GB2312 character. Yes, it's that simple. 0XA0 is a 16 binary number converted into decimal is actually equal to 160. The high byte equals 160 plus the area code, and you can understand that the GB2312 code character is actually starting from 160. Just like ASCII encoding is from 0 to 127 end.
Let's look again at the algorithm: a GB2312 character = = (0xa0 + area code, 0XA0 + bit number).
As you can see, as long as we know the location Code (area code and bit number), we will be able to figure out the GB2312 encoding of a Chinese character. Location code for the letter A: 0365, that is, the area code 03 digits 65. According to the above algorithm, we calculate. Convert 0xa0 to decimal equals 160. That is, (160+03,160+65) is equal to (163,225) replaced by 16 (the encoding table is generally 16) (A3,E1). OK, okay. The letter a gb2312 code out, take a3e1 to GB2312 the Code table to find the corresponding character of the 16 binary number. If you're right, it's the letter A on the coded list.
So, as long as remember the formula above, find a tool to calculate the Chinese character location code, and then set into the formula to calculate. You can get to the GB2312 encoded value of this character. We can try to figure out the GB2312 encoded value of the Chinese character "ah".
small sum up, we remember:
ASCII-encoded range-decimal => 0-127. Hex: 0x00-0x7f.
GB2312 encoding Range-decimal => high byte: 161-247. Hexadecimal: 0xa1-0xf7, low byte: 161-254. Hex: 0xa1-0xfe.
An extended encoding above the GB2312, the GBK encoding already includes the GB2312 encoding, and expands the GB2312 encoding so that it can represent more characters. GB2312 and GBK principle, they differ only, the coding value range is different. The GBK is even bigger.
GB2312 encoded value range: High byte from A1 to F7, while low byte from A1 to FE.
GBK encoded value range: High byte from 81 to Fe, while low byte from 40 to FE.
The above range can be seen. GBK is much bigger than GB2312. Big is big a lot ... But now the general project is encoded with UTF-8. Next, the next UTF-8 encoding method
In so many countries in the world, the language of each country is different. One will come out of the ASCII and a GBK will come out with a xxoo code. It broke down. Is it possible to invent a coding method that can represent all languages well? This is how Unicode encoding is generated. Here we only talk about a way to achieve Unicode. UTF-8, of course, there are other ways of implementation. But for our web development, it is not commonly used.
ASCII encoding is a good representation of letters, numbers, and so on. So UTF-8 on the basis of its expansion. By convention, let's take a look at the Unicode code table (UTF8 code table) first. Wood has. We need to learn how to convert from Unicode to UTF8)
Learn the purpose of this section
Mastering the method of converting from Unicode to UTF8 encoding determines the number of bytes of characters under UTF-8. Look at the table below:
Unicode byte bit table
Unicode encoding range decimal/16 binary
UTF-8 byte template binary/16 system
Number of bytes
(0) 000000– (127) 00007F
(128) 000080– (2047) 0007FF
110xxxxx (C2-DF) 10xxxxxx
(2048) 000800– (55295) 00d7ff (57344) 00e000– (65535) 00FFFF
1110xxxx (E0-EF) 10xxxxxx 10xxxxxx
(65536) 010000– (1114111) 10FFFF
11110xxx (F0-F4) 10xxxxxx 10xxxxxx 10xxxxxx
This table is very important, write down the table on the basic understanding of how UTF-8 is. UTF-8 can be represented in a total of four bytes. But the general character is basically three bytes can be satisfied.
One byte equals 8 bits. This is known to everyone. From 00000000-11111111 This is a byte range of values. Converted into decimal is 0-255. Understand this, let's go on.
Continue to look at the picture, we slowly speak:
UTF-8 in one byte:
In UTF-8, the ASCII encoding is preserved and then expanded to complement it. A byte is still a letter and the number is the same as the ASCII encoding. Therefore the coding range is also 0-127.
Some students wonder why it is 127. A byte conversion to binary is not 255. Because of a byte, the first bit is borrowed, the first digit is fixed to 0. Everyone looks at the first row of "UTF-8 byte template" In this column is clear. So in fact only 7 bits are used to represent characters. Then the conversion of the next 7-bit binary is only 0-127. This is almost the same as ASCII encoding, wondering what the 0-127 characters correspond to each other. Look at the ASCII code table.
UTF-8 of two bytes:
One byte is 8 bits, and two bytes is 16 bits. Wow. The larger the value, the more characters can be represented. So what Greek alphabet, Latin alphabet, etc. can be expressed in two bytes. Look at the column "UTF-8 byte template" in the second row. 110xxxxx 10xxxxxx has a total of 16 bits, 8 bits per byte. As you know, in a byte, the first bit is not available. Two bytes is a little different. At two bytes, the top three bits of the first byte are borrowed, and the first two bits of the second byte are lent. Well here, we just need to understand a place. In UTF-8 encoding, the encoding value range of the first byte when the character is represented by two bytes. The first byte is: 110xxxxx. Then the range from 11000000-11011111 to hexadecimal range is c2-df. Yes, that's enough. In the future, it is necessary to encounter the interception function and statistic length function of writing UTF-8 coding.
three bytes of UTF-8:
Three bytes, is the most we use, because we write in Chinese. But notice here is, three bytes under the borrow situation. Continue to look at the picture. 1110xxxx (E0-EF) 10xxxxxx 10xxxxxx See it. You know, if you don't understand ... Let's continue to look at it from the beginning. UTF-8 the next character is three bytes. What is the range of the first byte? This must be clarified. The range is from 11100000-11101111 hexadecimal is: e0-ef.
UTF-8 of four bytes:
This encounter is really not much. But you really understand the truth. I'm not going to tell you.
All right. We have accomplished a goal: to determine the number of bytes of characters under UTF-8. such as later development you encounter:
For this paragraph of text: "Reverse snow cold of php supplement". I want to count its character length under UTF-8 and implement intercept character channeling. I should be less flustered. Of course, it is said that statistical character length and interception of Chinese character channeling is not very simple. Mb_strlen, Mb_substr. Yes, I can. But I think we should know the reason why. Our goal is to develop PHP product level. Not PHP enterprise Web-level research and development-_-! 。
Next, complete another goal: Mastering the method of converting from Unicode to UTF8 encoding
We continue to look at the Unicode byte bit table. Look at the first column of Unicode encoding range. Four bytes, so there are four ranges. Look at this (0) 000000– (127) 00007F decimal starting from 0 to 127. This is the first byte of Unicode range. The other one is the same meaning.
Now that I understand the above, I'm going to start talking about the process of Unicode encoding conversion UTF-8:
We use the word "ah" for example, its Unicode encoding is u+554a (how to know.) Check the Unicode code table ah, Big Brother ... ） 。 And then we turn into UTF-8:
u+554a to decimal is 21834. The first column of the Unicode byte-bit table above. See 21834 is within the three-byte (2048) 000800– (55295) 00d7ff range. Because "Ah" in the UTF-8 is three bytes.
the three-byte UTF-8 template is (see Unicode byte bit table) 1110xxxx 10xxxxxx 10xxxxxx.
"Ah" u+554a converted into binary is: 010101 001010
fill in the 15-bit binary in order (less than the last 0) of the three-byte UTF-8 template. Which is 11100101 10010101 10001010. The first byte is low, so the top bit is 0.
The final result, 0xe5 0x95 0x8a These three is the "AH" word UTF-8 encoded.
$v 1= ' E5 ';
$v 2 = ' 95 ';
$v 3= ' 8A ';
$v 1=base_convert ($v 1, 16, 2);
$v 2=base_convert ($v 2, 16, 2);
$v 3=base_convert ($v 3, 16, 2);
echo $v 1. $v 2. $v 3. " <br> ";
$sss =$v 1. $v 2. $v 3; 111001011001010110001010