Simply talk about Unicode and UTF8 encoding in PHP, Unicodeutf8
Re-recognize Unicode and UTF8 encoding
Until today, to be exact, I just knew that UTF-8 encoding and Unicode encoding are not the same, there is a difference between the embarrassing
There is a certain connection between them, to see the difference between them:
The length of the UTF-8 is not necessarily, it can be 1, 2, 3 bytes
Unicode length must be 2 bytes (USC-2)
UTF-8 can and Unicode convert each other
The relationship between Unicode and UTF8
Unicode (16 binary)
UTF-8 (binary)
0000-007f 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx
The table above has 2 meanings, the first obvious is the correspondence between Unicode and the UTF-8 character range, and one can see how Unicode and UTF-8 convert to each other:
First, the conversion from UTF-8 to Unicode.
UTF-8 encoded binary and the above 3 formats to match, matching to remove the fixed bit (the table in the non-X position), and then from right to left each 8-bit group, not enough 8 bits left not to lead, together enough 2 bytes of bits, this is the UTF-8 corresponding Unicode encoding. Take a look at some of the following examples:
The text encoding format in the above picture is UTF-8, you can see its 16 binary representation with Winhex
Copy the Code code as follows:
Characters = UTF-8 = UTF-8 binary = Remove the fixed position 16 bits of binary = 16 binary
Han = e6b189 = 11100110 10110001 10001001 = 01101100 01001001 = 6c49
Word = E5ad97 = 11100101 10101101 10010111 = 01011011 01010111 = 5b57
#下面是在chrome命令行下面运行的结果
' \u6c49 '
Han
' \u5b57 '
Character
#到这里的话, converting from UTF-8 to Unicode is a very easy thing to do, look at the pseudo-code of the conversion
Read one byte, 11100110
Determine the format of the UTF-8 character, which belongs to the third type, 3 bytes
Continue reading 2 bytes get 11100101 10101101 10010111
Remove fixed bits by format 1011011 01010111
Not enough 16 bits, left 1,011,011 01010111 = 5b57
And look at the conversion from Unicode to UTF-8.
Copy the Code code as follows:
5b57
Get the Unicode range where 5b57 is located, 0800 <= 5b57 <= FFFF, learned 5b57 UTF-8 has three bytes in the form of 1110xxxx 10xxxxxx 10xxxxxx
Get 5b57 binary code 101101101010111
Use the binary encoding from the previous step right-to-left stitching UTF-8 encoding 11100101 10101101 10010111
Talk about the problem.
Say the cause of today's problem, from the front-end input many words, UTF-8 format each word up to 30 bytes, so it will be in the front-end and background verification, JavaScript is Unicode encoding, the backend program is UTF-8 encoding, now the solution is this
Front
function Utf8_bytes (str) {var len = 0, Unicode, for (var i = 0; i < str.length; i++) {unicode = Str.charcodeat (i); if (U Nicode < 0x0080) { ++len;} else if (Unicode < 0x0800) { Len + = 2;} else if (Unicode <= 0xFFFF) { len + = 3; }else { Throw ' characters must be usc-2!! '}} return len; #例子utf8_bytes (' Asdasdas ') 8utf8_bytes (' Yrt Yan ') 12
Background
#对于GBK字符串 $len = ceil (strlen (Bin2Hex (iconv (' GBK ', ' UTF-8 ', $word))/2), #对于UTF8字符串 $len = Ceil (strlen (Bin2Hex ($word))/2 );
The above mentioned is the whole content of this article, I hope you can like.
http://www.bkjia.com/PHPjc/1014437.html www.bkjia.com true http://www.bkjia.com/PHPjc/1014437.html techarticle simply talk about Unicode and UTF8 encoding in PHP, Unicodeutf8 re-recognize Unicode and UTF8 encoding until today, exactly, I know that UTF-8 encoding and Unicode encoding are not the same ...