PHP
mb_detect_encoding
can be based on the input string to determine exactly what kind of code it belongs to, how this judgment is done, utf-8 and ASC width are different, and the former is still long, how it determines whether the extra byte is the next word or the last byte of the word
Reply content:
PHP mb_detect_encoding
can be based on the input string to determine exactly what kind of code it belongs to, how this judgment is done, utf-8 and ASC width are different, and the former is still long, how it determines whether the extra byte is the next word or the last byte of the word
It's utf-8
ascii
better to say how to judge ascii
and not to differentiate.ascii
ascii
The maximum is 127
, when judged, as long as this byte is greater than 127, that 7f
is, it can be assumed that this byte is a multibyte encoding. Whether it is GBK
or is UTF-8
compatible ascii
.
1. utf-8
the first byte of each word has a total number of bytes for that word. All data types that are variable length are basically implemented, such as the fact that the database varchar
has more bytes saved, so it will not be misread.
2. GBK
is equal-width double-byte, as long as this byte is not a ascii
character, it and the next read together is OK
3. As to how to guess utf-8
and gbk
, I do not understand. Guess should be through some algorithms, to match the coding law or Code table bar, about this can refer to: http://blog.csdn.net/ecjtuync/article/details/1774429