PHP automatically judges character sets and transcodes them

Source: Internet
Author: User

The principle is very simple, because gb2312/gbk is a Chinese byte, the two bytes have a value range, while the Chinese character in UTF-8 is three bytes, and each byte also has a value range. The English language only occupies one byte (excluding the full width), regardless of the encoding ).

For file encoding checks, you can also directly check the BOM information of UTF-8. Let's not say much about it. The function is used to check and transcode strings.

<? Phpfunction safeEncoding ($ string, $ outEncoding = 'utf-8') {$ encoding = "UTF-8"; for ($ I = 0; $ I <strlen ($ string ); $ I ++) {if (ord ($ string {$ I}) <128) continue; if (ord ($ string {$ I}) & 224) = 224) {// The first byte is judged by $ char = $ string {++ $ I}; if (ord ($ char) & 128) = 128) {// The second byte is judged by $ char =$ string {++ $ I}; if (ord ($ char) & 128) = 128) {$ encoding = "UTF-8"; break ;}}if (ord ($ string {$ I}) & 192) = 192) {// The first byte is determined by $ char = $ st Ring {++ $ I}; if (ord ($ char) & 128) = 128) {// The second byte is determined by $ encoding = "GB2312 "; break ;}}if (strtoupper ($ encoding) = strtoupper ($ outEncoding) return $ string; else return iconv ($ encoding, $ outEncoding, $ string) ;}?>
About BOM

Byte-order mark (BOM) is a unified code character ("Zero Width, no broken Space") located at the Code Point U + FEFF "). This character is used to indicate its byte order when it is encoded as a string consisting of a UTF-16 or a UTF-32. It is often used as a sign that the file is coded as a UTF-8, UTF-16, or UTF-32.

In most character encodings, the byte sequence mark is a style that is unlikely to appear in other files (it usually looks like a series of obfuscated control codes ). If a byte sequence mark is misinterpreted as a real character in a unified code file, it is invisible because it is zero-width, zero-gap. In Unicode3.2, the usage of U + FEFF for non-byte sequence mark has been discarded (Instead, U + 2060 is used for this purpose ), meaning that U + FEFF can only be used for byte sequence mark.

In the UTF-16, the bytecode is placed as the first character of a file or string stream to mark this file or string stream, the tail order (in byte order) of the characters in units of all sixteen characters ).

  • If the unit of the sixteen bits is expressed as a large tail order, the byte sequence mark character will be 0xFE in the sequence, followed by 0xFF (the 0x is used to mark the hexadecimal system ).
  • If the unit of sixteen bits uses a small tail order, the byte sequence is 0xFF, followed by 0xFE.

In the unified code, the bit with the value U + FFFE is guaranteed not to be specified as a uniform character. This means that 0xFF and 0xFE can only be interpreted as U + FEFF in the Small Tail Order (because it cannot be U + FFFE In The Big tail order ).

UTF-8 does not have a topic in byte order. The byte sequence mark encoded by the UTF-8 is used to indicate that it is a UTF-8 file. It is only used to indicate the file of a UTF-8, not to describe the byte order. [1] many Windows programs (including notepad) add bytecode to the UTF-8 file. However, in Unix-like systems (using en: text file in large quantities for file formats and for inter-trip communication), this rule is not recommended. Because it will hinder the correct processing of some important codes such as en: Shebang at the beginning of the interpreter script. It also affects programming languages that cannot recognize it. For example, gcc reports that the source code file starts with unrecognized characters. In PHP, if output buffering is not enabled, the page content is sent to the browser (that is, the user header file has been sent ), this makes the PHP script unable to specify the user Header file (HTTP Header ). The byte sequence mark is represented as the sequence ef bb bf in the UTF-8, which is displayed in the UTF-8 environment for most text editors and Web browsers that are not ready to process the ISO-8859-1 ??? .

Although bytecode can also be used for UTF-32, this encoding is rarely used for transmission, and its rules are like UTF-16. For character set UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE registered with IANA, you cannot use bytecode marks. The U + FEFF at the beginning of the document will be interpreted as a (discarded) "zero-width, no-gap", because the names of these character sets determine their byte order. For registered Character Set UTF-16 and UTF-32, a starting U + FEFF is used to represent the byte order.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.