Simple talk about Unicode and UTF8 coding

Simple talk about Unicode and UTF8 coding _php techniques in PHP

Last Update:2017-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Re-understanding Unicode and UTF8 encoding

Until today, to be exact, I just realized that UTF-8 encoding and Unicode coding are not the same, and that there is a difference between embarrassing
There is a certain connection between them, to see the difference between them:
The length of the UTF-8 is not necessarily, it may be 1, 2, 3 bytes
Unicode length must be 2 bytes (USC-2)
UTF-8 can convert to and from Unicode

The relationship between Unicode and UTF8

Unicode (16 binary)

UTF-8 (binary system)

0000-007f 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx

The above table has 2 meanings, the first one is obvious is the correspondence of Unicode and UTF-8 character range, and one can see how Unicode and UTF-8 convert to each other:

First of all, UTF-8 to Unicode conversion

The UTF-8 encoded binary matches the 3 formats above. Match to remove the fixed bit (non-X position in the table), and then from right to left in each of the 8-digit group, not enough 8-bit left collar, up to 2 bytes of bits, this bits represents the UTF-8 corresponding Unicode encoding, Take a look at the following few examples:

The text encoding format in the above picture is UTF-8, you can see its 16 binary representation with Winhex

Copy Code code as follows:

Character => UTF-8 => UTF-8 binary => Remove fixed position for 16-bit binary => 16

e6b189 => => 11100110 10110001 10001001 => 01101100 01001001 => 6c49
Character => e5ad97 => 11100101 10101101 10010111 => 01011011 01010111 => 5b57

#下面是在chrome命令行下面运行的结果
' \u6c49 '
Han
' \u5b57 '
Word

#到这里的话, converting from UTF-8 to Unicode is a very easy thing to do, look at the pseudo code of the conversion
Read one byte, 11100110
Judge the format of the UTF-8 character, which belongs to the third, 3 bytes
Continue reading 2 bytes to get 11100101 10101101 10010111
Remove the fixed bit by the format 1011011 01010111
Not enough 16 digits, left 1,011,011 01010111 => 5b57

And look at the conversion from Unicode to UTF-8.

Copy Code code as follows:

5b57
Gets the Unicode range in which 5b57 is located, 0800 <= 5b57 <= FFFF, and is told that 5b57 has three bytes in the form of 1110xxxx 10xxxxxx 10xxxxxx
Gets the 5b57 binary code 101101101010111
Use the binary encoding of the previous step to stitch UTF-8 code 11100101 10101101 10010111 from right to left

Talk about the problem.

Again, the cause of today's problem, input a lot of words from the front, UTF-8 format each word up to 30 bytes, so it will be in front and backstage to do the verification, JavaScript is Unicode encoding, the back-end program with the UTF-8 code, now the solution is this

Front

function Utf8_bytes (str)
{
 var len = 0, Unicode;
 for (var i = 0; i < str.length i++)
 {
 unicode = str.charcodeat (i);
 if (Unicode < 0x0080) {
  ++len
 } else if (Unicode < 0x0800) {
  Len + = 2;
 } else if (Unicode <= 0xFF FF) {
  len + + 3;
 } else {
  Throw "characters must be usc-2!!"
 }
 }
 return len;
}

#例子
utf8_bytes (' Asdasdas ')
8
utf8_bytes (' Yrt Yan ')
12

Background

#对于GBK字符串
$len = ceil (strlen (Bin2Hex iconv (' GBK ', ' UTF-8 ', $word)))/2);
#对于UTF8字符串
$len = ceil (strlen (Bin2Hex ($word))/2);

The above mentioned is the entire content of this article, I hope you can enjoy.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Simple talk about Unicode and UTF8 coding _php techniques in PHP

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support