Some simple thoughts on character coding

Last Update:2014-12-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous article mentioned in the BOM header, actually involved in the text encoding problem, the BOM header is appearing under Windows with a text editor after writing the file, in accordance with the UTF-8 format to save the file and we are in the editing of PHP script is usually in utf-8 format to save the script file, in this case, We can't find the BOM header. However, if we save PHP script files according to GBK code, we can easily find the existence of BOM head, the reason is simple

We first look at the coding rules of Utf-8, Utf-8 uses dynamic encoding, in fact, in bytes of Unicode (universal code) to do the re-encoding, for 0x00-0x7f between the characters, UTF-8 encoding and ASCII encoding exactly the same. The maximum length of a UTF-8 encoding is 4 bytes. The coding rules of UTF8 can be seen in general.

From the coding rules, we can see that the BOM header 0xEF 0xBB 0xBF conforms to Utf-8 point to a character encoding rules, then if the GBK coding rules to look at the BOM header, we look at, first, we look at GBK coding rules:
"GBK also uses double-byte representation, the overall encoding range is 8140-fefe, the first byte between the 81-fe, the tail byte between 40-fe, culling xx7f a line. A total of 23,940 code positions, a total income of 21,886 Chinese characters and graphic symbols, including Chinese characters (including radicals and components) 21,003, graphic symbols 883. "(source Baidu Encyclopedia);

We select a string with BOM header look at <?xml version, through XXD, we look at the hexadecimal encoding rules as follows EFBB bf3c 3f78 6d6c 2076 6572 7369 6f6e According to UTF8 code, first see E for "1110", indicating This character accounted for 3 bytes, EFBB BF was parsed out, in the look 3, binary is "0011", in line with the ASCII code rules, you can find the 3f corresponding character "<", so go on, according to UTF-8 encoding rules parsing text.

If you follow the coding rules of GBK, how do you explain it? According to the preceding rules, EFBB conforms to the GBK decoding rules, corresponding to the Chinese character "Nobelium", then the BF, in the GBK of the first byte range, and 3c is obviously not GBK encoded tail byte range, so actually can not parse out the correct encoding.

But we have found that if the uniform requires that the code we write is stored according to GBK encoding, the BOM header will not appear.

Second, the encoding of the code stored in PHP and code execution output of the relationship between the encoding

The previous paragraph mentions how the code is stored in what encoding, like utf-8,gbk,utf-16 and so on, but the code needs to perform related operations, such as read and output operations, and this place involves several issues
1, UTF8 encoded stored code read GBK encoded input, according to what kind of situation decoding?
2, UTF8 encoding stored code output string, the output of the string is what kind of encoding?

These two problems, in the experiment found that if the PHP script is UTF8 encoded storage, then it output string, the output string will be UTF8 encoding, if the PHP script is GBK encoded storage, the output string will be GBK string. This, it will explain a problem is. If the PHP script output an XML, he and the XML declaration of Encoding=utf8 or GBK is not related. This encoding just tells the caller what encoding format to decode. So in this place to note that the encoding format of the script store is inconsistent with the XML encoding, it needs to do the transcoding work.

Third, substr in PHP script with the use of the problem

Substr in PHP is mainly used to intercept strings, if the string is English, generally no problem, you can count the number of characters, but if it is Chinese, you need to be very careful, first of all to determine what the input character is what kind of encoding, if it is GBK encoding. Chinese characters are two bytes, and in the case of UTF8 encoding, Chinese characters are 3 to 4 bytes. So judging the input of Chinese characters, it will be very troublesome, if you have to judge, you can try the following steps
1, first convert the input characters into GBK format,
2, from the beginning to read the string, determine whether the byte is greater than 0x7f, if greater than 0x7f, then read two bytes, anyway read a byte.

Some simple thoughts on character coding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Some simple thoughts on character coding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Some simple thoughts on character coding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support