A brief introduction to unicode and utf8 encoding in php. A brief talk about unicode and utf8 encoding in php re-recognize unicode and utf8 encoding until today, accurately said just now, I only know that UTF-8 encoding and Unicode encoding is not the same, there is a difference. let's talk about unicode and utf8 encoding in php.
Recognition of unicode and utf8 encoding
Until today, to be accurate, just now, I know that UTF-8 encoding and Unicode encoding are different, there is a difference between encoding
There is a certain relationship between them. let's look at their differences:
UTF-8 length is not necessarily, may be 1, 2, 3 bytes
Unicode length: 2 bytes (USC-2)
The UTF-8 can be converted to Unicode
Unicode and utf8
Unicode (hexadecimal)
UTF-8 (binary)
0000-007F 0 xxxxxxx
0080-07FF 110 xxxxx 10 xxxxxx
0800-FFFF 1110 xxxx 10 xxxxxx 10 xxxxxx
The above table has 2 meanings, the first obvious is Unicode and UTF-8 character range correspondence, there is a can see how Unicode and UTF-8 conversion:
First let's talk about the conversion from UTF-8 to Unicode
The binary code of the UTF-8 matches with the above three formats, removes the fixed bit after matching (non-x position in the table), and then a group of every 8 digits from right to left, not 8 bits left not collar, enough 2 bytes 16 bits, the 16 bits represents the Unicode encoding corresponding to the UTF-8, take a look at the following example:
The text encoding format in the above picture is UTF-8, you can use WinHex to see its hexadecimal representation
The code is as follows:
Character => UTF-8 => UTF-8 binary => remove a fixed position to make up 16-bit binary => hexadecimal
Han => E6B189 => 11100110 10110001 10001001 => 01101100 01001001 => 6C49
Word => E5AD97 => 11100101 10101101 10010111 => 01011011 01010111 => 5B57
# The following is the result of running in chrome command line.
'\ U6c49'
"Han"
'\ U5B57'
"Word"
# Here, converting from UTF-8 to Unicode is already a very easy thing, look at the pseudo code of conversion
Read one byte, 11100110
Determines the format of the UTF-8 character, which belongs to the third, 3 bytes
Continue to read 2 bytes to get 11100101 10101101 10010111
Remove fixed bits 1011011 01010111 by format
Less than 16 bits, left 0 01011011 => 5B57
Let's look at the conversion from Unicode to UTF-8.
The code is as follows:
5B57
Get the Unicode range of 5B57, 0800 <= 5B57 <= FFFF, it is known that the 5B57 UTF-8 has three bytes, in the form of 1110 xxxx 10 xxxxxx 10 xxxxxx
Obtain the 5B57 binary code 101101101010111
Splice the UTF-8 code from right to left using the binary code of the previous step 11100101 10101101 10010111
Question
Let's talk about the cause of today's problem, from the front-end input a lot of words, UTF-8 format each word up to 30 bytes, so the front-end and the background are verified, javascript uses Unicode encoding, the back-end program uses UTF-8 encoding, and now the solution is
Front end
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
Function utf8_bytes (str) { Var len = 0, unicode; For (var I = 0; I <str. length; I ++) { Unicode = str. charCodeAt (I ); If (unicode <0 x0080 ){ ++ Len; } Else if (unicode <0x0800 ){ Len + = 2; } Else if (unicode <= 0 xFFFF ){ Len + = 3; } Else { Throw "characters must be USC-2 !! " } } Return len; } # Example Utf8_bytes ('asdasdas ') 8 Utf8_bytes ('yrt Yan Ruitao ') 12 |
Background
1 2 3 4 |
# For GBK strings $ Len = ceil (strlen (bin2hex (iconv ('gbk', 'utf-8', $ word)/2 ); # For UTF8 strings $ Len = ceil (strlen (bin2hex ($ word)/2 ); |
The above is all the content of this article. I hope you will like it.
Coding re-understand unicode and utf8 encoding until today, accurate said just now, I only know that UTF-8 encoding and Unicode encoding is not the same, there is a difference...