Some simple thoughts on character coding

Source: Internet
Author: User

In the previous article mentioned in the BOM header, actually involved in the text encoding problem, the BOM header is appearing under Windows with a text editor after writing the file, in accordance with the UTF-8 format to save the file and we are in the editing of PHP script is usually in utf-8 format to save the script file, in this case, We can't find the BOM header. However, if we save PHP script files according to GBK code, we can easily find the existence of BOM head, the reason is simple

We first look at the coding rules of Utf-8, Utf-8 uses dynamic encoding, in fact, in bytes of Unicode (universal code) to do the re-encoding, for 0x00-0x7f between the characters, UTF-8 encoding and ASCII encoding exactly the same. The maximum length of a UTF-8 encoding is 4 bytes. The coding rules of UTF8 can be seen in general.


From the coding rules, we can see that the BOM header 0xEF 0xBB 0xBF conforms to Utf-8 point to a character encoding rules, then if the GBK coding rules to look at the BOM header, we look at, first, we look at GBK coding rules:
"GBK also uses double-byte representation, the overall encoding range is 8140-fefe, the first byte between the 81-fe, the tail byte between 40-fe, culling xx7f a line. A total of 23,940 code positions, a total income of 21,886 Chinese characters and graphic symbols, including Chinese characters (including radicals and components) 21,003, graphic symbols 883. "(source Baidu Encyclopedia);


We select a string with BOM header look at <?xml version, through XXD, we look at the hexadecimal encoding rules as follows EFBB bf3c 3f78 6d6c 2076 6572 7369 6f6e According to UTF8 code, first see E for "1110", indicating This character accounted for 3 bytes, EFBB BF was parsed out, in the look 3, binary is "0011", in line with the ASCII code rules, you can find the 3f corresponding character "<", so go on, according to UTF-8 encoding rules parsing text.


 

If you follow the coding rules of GBK, how do you explain it? According to the preceding rules, EFBB conforms to the GBK decoding rules, corresponding to the Chinese character "Nobelium", then the BF, in the GBK of the first byte range, and 3c is obviously not GBK encoded tail byte range, so actually can not parse out the correct encoding.

But we have found that if the uniform requires that the code we write is stored according to GBK encoding, the BOM header will not appear.

Second, the encoding of the code stored in PHP and code execution output of the relationship between the encoding

The previous paragraph mentions how the code is stored in what encoding, like utf-8,gbk,utf-16 and so on, but the code needs to perform related operations, such as read and output operations, and this place involves several issues
1, UTF8 encoded stored code read GBK encoded input, according to what kind of situation decoding?
2, UTF8 encoding stored code output string, the output of the string is what kind of encoding?

These two problems, in the experiment found that if the PHP script is UTF8 encoded storage, then it output string, the output string will be UTF8 encoding, if the PHP script is GBK encoded storage, the output string will be GBK string. This, it will explain a problem is. If the PHP script output an XML, he and the XML declaration of Encoding=utf8 or GBK is not related. This encoding just tells the caller what encoding format to decode. So in this place to note that the encoding format of the script store is inconsistent with the XML encoding, it needs to do the transcoding work.


Third, substr in PHP script with the use of the problem

Substr in PHP is mainly used to intercept strings, if the string is English, generally no problem, you can count the number of characters, but if it is Chinese, you need to be very careful, first of all to determine what the input character is what kind of encoding, if it is GBK encoding. Chinese characters are two bytes, and in the case of UTF8 encoding, Chinese characters are 3 to 4 bytes. So judging the input of Chinese characters, it will be very troublesome, if you have to judge, you can try the following steps
1, first convert the input characters into GBK format,
2, from the beginning to read the string, determine whether the byte is greater than 0x7f, if greater than 0x7f, then read two bytes, anyway read a byte.





Some simple thoughts on character coding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.