Re-understanding Unicode and UTF8 encoding
Until today, to be exact, I just realized that UTF-8 encoding and Unicode coding are not the same, and that there is a difference between embarrassing
There is a certain connection between them, to see the difference between them:
The length of the UTF-8 is not necessarily, it may be 1, 2, 3 bytes
Unicode length must be 2 bytes (USC-2)
UTF-8 can convert to and from Unicode
The relationship between Unicode and UTF8
Unicode (16 binary)
UTF-8 (binary system)
0000-007f 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx
The above table has 2 meanings, the first one is obvious is the correspondence of Unicode and UTF-8 character range, and one can see how Unicode and UTF-8 convert to each other:
First of all, UTF-8 to Unicode conversion
The UTF-8 encoded binary matches the 3 formats above. Match to remove the fixed bit (non-X position in the table), and then from right to left in each of the 8-digit group, not enough 8-bit left collar, up to 2 bytes of bits, this bits represents the UTF-8 corresponding Unicode encoding, Take a look at the following few examples:
The text encoding format in the above picture is UTF-8, you can see its 16 binary representation with Winhex
Copy Code code as follows:
Character => UTF-8 => UTF-8 binary => Remove fixed position for 16-bit binary => 16
e6b189 => => 11100110 10110001 10001001 => 01101100 01001001 => 6c49
Character => e5ad97 => 11100101 10101101 10010111 => 01011011 01010111 => 5b57
#下面是在chrome命令行下面运行的结果
' \u6c49 '
Han
' \u5b57 '
Word
#到这里的话, converting from UTF-8 to Unicode is a very easy thing to do, look at the pseudo code of the conversion
Read one byte, 11100110
Judge the format of the UTF-8 character, which belongs to the third, 3 bytes
Continue reading 2 bytes to get 11100101 10101101 10010111
Remove the fixed bit by the format 1011011 01010111
Not enough 16 digits, left 1,011,011 01010111 => 5b57
And look at the conversion from Unicode to UTF-8.
Copy Code code as follows:
5b57
Gets the Unicode range in which 5b57 is located, 0800 <= 5b57 <= FFFF, and is told that 5b57 has three bytes in the form of 1110xxxx 10xxxxxx 10xxxxxx
Gets the 5b57 binary code 101101101010111
Use the binary encoding of the previous step to stitch UTF-8 code 11100101 10101101 10010111 from right to left
Talk about the problem.
Again, the cause of today's problem, input a lot of words from the front, UTF-8 format each word up to 30 bytes, so it will be in front and backstage to do the verification, JavaScript is Unicode encoding, the back-end program with the UTF-8 code, now the solution is this
Front
function Utf8_bytes (str)
{
var len = 0, Unicode;
for (var i = 0; i < str.length i++)
{
unicode = str.charcodeat (i);
if (Unicode < 0x0080) {
++len
} else if (Unicode < 0x0800) {
Len + = 2;
} else if (Unicode <= 0xFF FF) {
len + + 3;
} else {
Throw "characters must be usc-2!!"
}
}
return len;
}
#例子
utf8_bytes (' Asdasdas ')
8
utf8_bytes (' Yrt Yan ')
12
Background
#对于GBK字符串
$len = ceil (strlen (Bin2Hex iconv (' GBK ', ' UTF-8 ', $word)))/2);
#对于UTF8字符串
$len = ceil (strlen (Bin2Hex ($word))/2);
The above mentioned is the entire content of this article, I hope you can enjoy.