The principle is very simple, because GB2312/GBK is Chinese two bytes, these two bytes is the range of values, and utf-8 in the Chinese characters are three bytes, also each byte has a value range. The English language is less than 128 and takes only one byte (except the full-width), regardless of the encoding.
If it is a file-form encoding check, you can also check the Utf-8 BOM information directly. Not much to say, directly on the function, this function is used to check and transcode the string.
About BOM
The byte order notation (English: Byte-order Mark,bom) is the Uniform Code character ("0 width without breaking blank") located in the code point U+feff. When UTF-16 or UTF-32 is used to encode a string of ucs/Uniform code characters, this character is applied to indicate its byte order. It is often used as a sign that the document is encoded with UTF-8, UTF-16, or UTF-32.
In most character encodings, the byte order notation is a style that is unlikely to appear in other files (it usually looks like a sequence of confusing control codes). If a byte order is misinterpreted as a real character in a uniform code file, it will not be visible because it is a 0-width no-break space. In Unicode3.2, the use of U+feff for non-byte order tokens has been discarded (instead, using u+2060 for this purpose) to allow U+feff to be used only for the semantics of byte-order notation.
In UTF-16, the byte order mark is placed as the first character of the file or string stream to indicate the tail order (byte order) of the loadline in this file or string stream, in all 16-bit units.
- If the 16-bit unit is represented as a large-tailed sequence, this byte-order-mark character is rendered 0xFE in the sequence followed by 0xFF (where 0x is used to indicate hexadecimal).
- If the 16-bit unit uses a small-tailed sequence, the byte sequence is 0xFF, followed by 0xFE.
In the unified code, code bits with a value of U+fffe are guaranteed not to be specified as a uniform code character. This means that 0xFF, 0xFE will only be interpreted as U+feff in small-tailed order (because it is not possible to U+fffe in large-tailed order).
UTF-8 does not have a byte-order issue. The UTF-8 encoded byte order notation is used to indicate that it is a UTF-8 file. It is used only to mark a UTF-8 file, not to describe the byte order. [1] Many Windows programs (including Notepad) Add byte order tokens to the UTF-8 file. However, this practice is not recommended in Unix-like system systems (where a large number of en:text files are used for file formats for inter-trip communication). Because it interferes with the correct handling of some important codes such as En:shebang at the beginning of the interpreter script. It also affects programming languages that do not recognize it. If GCC reports an unrecognized character at the beginning of the source file. In PHP, if output buffering is not enabled, it causes the page content to start being sent to the browser (that is, the user header file has been sent out), which makes the PHP script unable to specify the user header file (HTTP header). The byte order notation is represented in UTF-8 as a sequence of EF BB BF, and for most of the text editors and Web browsers that are not ready to handle UTF-8, the ISO-8859-1 environment is displayed??? 。
Although byte-order notation can also be used for UTF-32, this encoding is seldom used for transmission, and its rules are similar to UTF-16. BYTE order tokens are not allowed for character sets Utf-16be, Utf-16le, Utf-32be, and utf-32le that have been registered with the IANA. The U+feff at the beginning of the document is interpreted as a (discarded) "0-width no-break" because the names of these character sets determine their byte order. For registered character sets UTF-16 and UTF-32, a U+feff at the beginning is used to represent the byte order.
http://www.bkjia.com/PHPjc/752415.html www.bkjia.com true http://www.bkjia.com/PHPjc/752415.html techarticle The principle is very simple, because GB2312/GBK is Chinese two bytes, these two bytes is the range of values, and utf-8 in the Chinese characters are three bytes, also each byte has a value range. And the English is in ...