What exactly is the relationship between UTF8 and Unicode encoding? What's the difference? _ Basic Tutorial

Source: Internet
Author: User
Tags character set
UTF8 = = Unicode Transformation Format--8 bit
is a Unicode transfer format. Converts a Unicode file to a byte transfer stream.

UTF8 Flow Conversion Program:
input:unsigned integer c-the code point of the character to is encoded (enter a Unicode value)
Output:byte B1, B2,B3, b4-the encoded sequence of bytes (output four byte)
Algorithm (algorithm):
if (c<0x80)
B1 = c>>0 & 0x7F | 0x00
B2 = null
B3 = null
B4 = null
else if (c<0x0800)
B1 = c>>6 & 0x1F | 0xc0
b2 = c>>0 & 0x3F | 0x80
B3 = null
B4 = null
else if (c<0x010000)
B1 = c>>12 & 0x0f | 0xe0
b2 = c>>6 & 0x3F | 0x80
B3 = c>>0 & 0x3F | 0x80
B4 = null
else if (c<0x110000)
B1 = c>>18 & 0x07 | 0xF0
b2 = c>>12 & 0x3F | 0x80
B3 = c>>6 & 0x3F | 0x80
B4 = c>>0 & 0x3F | 0x80
End If
=====================
Unicode is a coded form, for example, to specify a code for a Chinese character. Similar to gb2312-1980, GB18030 and so on, but the word set is different.
=====================
A Unicode code may be converted to a byte, or two, three, and a UTF8 code of four byte, depending on the value of the Unicode code. English Unicode code because the value is less than 0x80, as long as a byte of UTF8 transmission, than send Unicode two bytes faster.
UTF8 is the "re-coding" method that comes out of Unicode for transmission.
UTF8 to Unicode with the program I gave above the inverse calculation can be.

UTF8 is a transition solution for existing ASCII systems to the Unicode system. UTF8 is to ensure ASCII compatibility and then extend to the direction of the large character set. This is a recommended scheme for Unicode. But because of the different angle of problem solving, it is not a good solution to the existing Chinese system. The connection provides a preliminary knowledge of the detailed UTF8 encoding http://www.acnis.com/modules.php?name=ArticlE&file=article&sid=102 reference: http:// www.acnis.com/modules.php?name=ArticlE&file=article&sid=102

What is Unicode. The basic goal of Unicode is to unify all encodings, that is, it contains all the character sets. This allows the character set to be processed as long as one system supports Unicode. General Unicode has two bytes. The Windows operating system now supports Unicode.

What is UTF8? UTF8 is a Unicode encoding, that is, its coded character set and Unicode are consistent. But the coding is not the same way. For English characters, the UTF8 encoding is used in a byte as usual. But for Chinese, it's three bytes (three in memory).

The disadvantage of UTF8 and Unicode is that when dealing with problems such as finding and searching, it seems that the algorithm is more complex and inefficient (in memory).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.