PHP automatically judges character sets and transcodes them

Last Update:2014-04-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The principle is very simple, because gb2312/gbk is a Chinese byte, the two bytes have a value range, while the Chinese character in UTF-8 is three bytes, and each byte also has a value range. The English language only occupies one byte (excluding the full width), regardless of the encoding ).

For file encoding checks, you can also directly check the BOM information of UTF-8. Let's not say much about it. The function is used to check and transcode strings.

<? Phpfunction safeEncoding ($ string, $ outEncoding = 'utf-8') {$ encoding = "UTF-8"; for ($ I = 0; $ I <strlen ($ string ); $ I ++) {if (ord ($ string {$ I}) <128) continue; if (ord ($ string {$ I}) & 224) = 224) {// The first byte is judged by $ char = $ string {++ $ I}; if (ord ($ char) & 128) = 128) {// The second byte is judged by $ char =$ string {++ $ I}; if (ord ($ char) & 128) = 128) {$ encoding = "UTF-8"; break ;}}if (ord ($ string {$ I}) & 192) = 192) {// The first byte is determined by $ char = $ st Ring {++ $ I}; if (ord ($ char) & 128) = 128) {// The second byte is determined by $ encoding = "GB2312 "; break ;}}if (strtoupper ($ encoding) = strtoupper ($ outEncoding) return $ string; else return iconv ($ encoding, $ outEncoding, $ string) ;}?>

About BOM

Byte-order mark (BOM) is a unified code character ("Zero Width, no broken Space") located at the Code Point U + FEFF "). This character is used to indicate its byte order when it is encoded as a string consisting of a UTF-16 or a UTF-32. It is often used as a sign that the file is coded as a UTF-8, UTF-16, or UTF-32.

In most character encodings, the byte sequence mark is a style that is unlikely to appear in other files (it usually looks like a series of obfuscated control codes ). If a byte sequence mark is misinterpreted as a real character in a unified code file, it is invisible because it is zero-width, zero-gap. In Unicode3.2, the usage of U + FEFF for non-byte sequence mark has been discarded (Instead, U + 2060 is used for this purpose ), meaning that U + FEFF can only be used for byte sequence mark.

In the UTF-16, the bytecode is placed as the first character of a file or string stream to mark this file or string stream, the tail order (in byte order) of the characters in units of all sixteen characters ).

If the unit of the sixteen bits is expressed as a large tail order, the byte sequence mark character will be 0xFE in the sequence, followed by 0xFF (the 0x is used to mark the hexadecimal system ).
If the unit of sixteen bits uses a small tail order, the byte sequence is 0xFF, followed by 0xFE.

In the unified code, the bit with the value U + FFFE is guaranteed not to be specified as a uniform character. This means that 0xFF and 0xFE can only be interpreted as U + FEFF in the Small Tail Order (because it cannot be U + FFFE In The Big tail order ).

UTF-8 does not have a topic in byte order. The byte sequence mark encoded by the UTF-8 is used to indicate that it is a UTF-8 file. It is only used to indicate the file of a UTF-8, not to describe the byte order. [1] many Windows programs (including notepad) add bytecode to the UTF-8 file. However, in Unix-like systems (using en: text file in large quantities for file formats and for inter-trip communication), this rule is not recommended. Because it will hinder the correct processing of some important codes such as en: Shebang at the beginning of the interpreter script. It also affects programming languages that cannot recognize it. For example, gcc reports that the source code file starts with unrecognized characters. In PHP, if output buffering is not enabled, the page content is sent to the browser (that is, the user header file has been sent ), this makes the PHP script unable to specify the user Header file (HTTP Header ). The byte sequence mark is represented as the sequence ef bb bf in the UTF-8, which is displayed in the UTF-8 environment for most text editors and Web browsers that are not ready to process the ISO-8859-1 ??? .

Although bytecode can also be used for UTF-32, this encoding is rarely used for transmission, and its rules are like UTF-16. For character set UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE registered with IANA, you cannot use bytecode marks. The U + FEFF at the beginning of the document will be interpreted as a (discarded) "zero-width, no-gap", because the names of these character sets determine their byte order. For registered Character Set UTF-16 and UTF-32, a starting U + FEFF is used to represent the byte order.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PHP automatically judges character sets and transcodes them

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support