Basics of UTF-8 Character Set

Source: Internet
Author: User
Tags control characters printable characters

Basics of UTF-8 Character Set

Brief character set history

Among all character sets, the most well-known number is the 7-bit ASCII character set. It is short for American Standards Committee for information interchange. It is designed for American English communication. It consists of 128 characters, including uppercase and lowercase letters, numbers 0-9, punctuation marks, non-printable characters (line breaks, tabs, etc.) and control characters (backoff, ringback, etc).

However, because it is designed for English, it may occur when dealing with European text with a tone sign (such as a Chinese pinyin. Therefore, a number of ASCII extended character sets including 255 characters are created. One of these is usually the IBM character set, which uses characters between 128-255 for drawing and draw lines, as well as some special European characters. Another 8-bit character set is ISO 8859-1 Latin 1, also referred to as ISO Latin-1. It uses characters between 128-255 to encode special language characters in the Latin alphabet.

The European language is not the only language on the earth. Therefore, the Asian and African languages cannot be supported by 8-bit character sets. Only the Chinese (or pictograms) alphabet has more than 80000 characters. However, some similar characters in Chinese, Japanese, and Vietnamese are combined to make different characters represent different words in different languages, in this way, only two bytes can be used to encode the text of almost all regions on the earth. Therefore, Unicode encoding is created. It expands the ISO Latin-1 character set by adding a high byte. When the high byte is 0, the low byte is ISO Latin-1 character. Unicode supports Europe, Africa, the Middle East, and Asia (including the unified standard East Asian and South Korean Chinese characters ). However, Unicode does not support text such as Braille, Cherokee, Ethiopic, Khmer, Mongolian, Hmong, Tai Lu, and Tai Mau. It also does not support ancient texts such as ahom, Akkadian, Aramaic, Babylonian cuneiform, Balti, brahmi, Etruscan, Hittite, Javanese, Numidian, old Persian cuneiform, and Syrian.

It turns out that it is not efficient to use Unicode for characters that can be expressed in ASCII, because Unicode is twice the space occupied by ASCII, and 0 in ASCII is useless. To solve this problem, some intermediate formats of character sets emerged. They are called universal conversion formats (UTF (Universal Transformation Format ). The existing UTF formats include: UTF-7, UTF-7.5, UTF-8, UTF-16, and UTF-32. This article discusses the basis of the UTF-8 character set.

Utf_8 Character Set

UTF-8 is a variable-length character encoding of Unicode, which was created in 1992 by Ken Thompson. Now it has been standardized as RFC 3629. The UTF-8 encodes Unicode characters in 1 to 6 bytes. If Unicode characters are represented by 2 bytes, it is likely to require 3 bytes to be encoded into the UTF-8, And if Unicode characters are represented by 4 bytes, it may require 6 bytes to be encoded into the UTF-8. There may be too many Unicode characters encoded with four or six bytes, but such Unicode characters are rarely encountered.

The UFT-8 conversion table represents the following:

Unicode UTF-8
00000000-0000007f 0 xxxxxxx
00000080-000007ff 110 XXXXX 10 xxxxxx
00000800-0000 FFFF 1110 XXXX 10 xxxxxx 10 xxxxxx
00010000-001 fffff 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
00200000-03 ffffff 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
04000000-7 fffffff 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

Unicode characters that actually represent ASCII characters are encoded into 1 byte, And the UTF-8 representation is the same as the ASCII character representation. Converting all other uncode characters to UTF-8 requires at least 2 bytes. Each byte starts with a code-changing sequence. The first byte consists of a unique code-changing sequence, which consists of N-bit 1 and one-bit 0. N-bit 1 indicates the number of bytes required for character encoding.

Example

Unicode UCA (11001010) encoding into a UTF-8 will take 2 bytes:

UCA-> C3 8a

1100 1010
110 XXXXX 10 xxxxxx

1100 1010-> 110 XXXXX 10 xxxxxx
-> 110 XXXXX 10xxxxx0
-> 110 XXXXX 10xxxx10
-> 110 XXXXX 10xxx010
-> 110 XXXXX 10xx1010
-> 110 XXXXX 10x01010
-> 110 XXXXX 10001010
-> 110xxxx1 10001010
-> 110xxx11 10001010
-> 11000011 10001010
-> C3 8a

Unicode uf03f (11110000 00111111) encoding into a UTF-8 will take 3 bytes:

U f03f-> EF 80 BF

1111 0000 0011 1111-> 1110 XXXX 10 xxxxxx 10 xxxxxx
-> 11101111 10000000 10111111
-> EF 80 BF

Note: from the above analysis, we can see that the uncode to UTF-8 conversion is to determine the number of bytes required for encoding, then, use the Unicode encoding bits to enter the bits represented as X in sequence from low to high, and the low bits are supplemented with 0. The above is my personal experience. If there is any error, please do not hesitate to advise. Thank you first :)

Advantages of UTF-8 Coding:

UTF-8 encoding can be quickly read and written through shielding bit and shift operations.
When comparing strings, strcmp () and wcscmp () have the same returned results, making sorting easier.
Byte ff and Fe will never appear in UTF-8 encoding, so they can be used to indicate UTF-16 or UTF-32 text (see BOM)
The UTF-8 is byte order independent. Its byte order is the same in all systems, so it does not actually need Bom.

Disadvantages of UTF-8 Encoding:

You cannot determine the number of bytes of the UTF-8 text from the number of Unicode characters, because the UTF-8 is a variable-length encoding
It requires two bytes to encode those characters with only one byte in the extended ASCII character set.
ISO Latin-1 is a subset of Unicode but not a subset of the UTF-8
The 8-character UTF-8 code is filtered by the email gateway because Internet information is initially designed to be 7 ASCII code. Thus a UTF-7 code is produced.
The UTF-8 in its representation uses a value of 100xxxxx with a probability of over 50%, while the existing implementations such as ISO 2022,487 3, 6429, and 8859 system will regard it as a C1 control code. Thus a UTF-7.5 code is produced.

UTF-8 corrected:

Java uses UTF-16 to represent Internal text and supports non-standard correction UTF-8 encoding for string serialization. There are two differences between standard UTF-8 and corrected UTF-8:
In the corrected UTF-8, the null character is encoded into 2 bytes (11000000 00000000) instead of the standard 1 byte (00000000), which ensures that the encoded string is not embedded with null characters. Therefore, if a string is processed in C-like languages, the text is not truncated when the first null character is entered (the C string ends with null ).
In standard UTF-8 encoding, characters beyond the basic multilingual range (BMP-Basic multilingual plain) are encoded in 4-byte format, but in corrected UTF-8 encoding, they are represented by a proxy pair (surrogate pairs) and these proxy pairs are re-encoded separately in the sequence. Results The standard UTF-8 encoding requires 4-byte characters and 6 bytes in the corrected UTF-8 encoding.

Bid mark BOM

BOM (byte order mark) is a character that represents the UTF-16 of Unicode text, the encoding byte order of UTF-32 (high Byte Low byte order) and the encoding method (UTF-8, UTF-16, UTF-32, where the UTF-8 encoding is byte order independent ).

As follows:

Encoding Representation 
UTF-8 EF BB BF
UTF-16 big endian Fe FF
Little endian FF Fe UTF-16
UTF-32 big endian 00 00 Fe FF
UTF-32 little endian FF Fe 00

Example of UTF-8 C ++ program encoding:

The following are four C ++ functions that implement conversion between 2-byte and 4-byte Unicode and UTF-8, respectively.

# Define maskbits 0x3f
# Define maskbyte 0x80
# Define mask2bytes 0xc0
# Define mask3bytes 0xe0
# Define mask4bytes 0xf0
# Define mask5bytes 0xf8
# Define mask6bytes 0xfc

Typedef unsigned short unicode2bytes;
Typedef unsigned int unicode4bytes;

Void utf8encode2bytesunicode (STD: vector <unicode2bytes> input,
STD: vector <byte> & output)
{
For (INT I = 0; I <input. Size (); I ++)
{
// 0 xxxxxxx
If (input [I] <0x80)
{
Output. push_back (byte) input [I]);
}
// 110 XXXXX 10 xxxxxx
Else if (input [I] <0x800)
{
Output. push_back (byte) (mask2bytes | input [I]> 6 ));
Output. push_back (byte) (maskbyte | input [I] & maskbits ));
}
// 1110 XXXX 10 xxxxxx 10 xxxxxx
Else if (input [I] <0x10000)
{
Output. push_back (byte) (mask3bytes | input [I]> 12 ));
Output. push_back (byte) (maskbyte | input [I]> 6 & maskbits ));
Output. push_back (byte) (maskbyte | input [I] & maskbits ));
}
}
}

Void utf8decode2bytesunicode (STD: vector <byte> input,
STD: vector <unicode2bytes> & output)
{
For (INT I = 0; I <input. Size ();)
{
Unicode2bytes ch;

// 1110 XXXX 10 xxxxxx 10 xxxxxx
If (input [I] & mask3bytes) = mask3bytes)
{
Ch = (input [I] & 0x0f) <12) | (
(Input [I + 1] & maskbits) <6)
| (Input [I + 2] & maskbits );
I + = 3;
}
// 110 XXXXX 10 xxxxxx
Else if (input [I] & mask2bytes) = mask2bytes)
{
Ch = (input [I] & 0x1f) <6) | (input [I + 1] & maskbits );
I + = 2;
}
// 0 xxxxxxx
Else if (input [I] <maskbyte)
{
Ch = input [I];
I + = 1;
}
Output. push_back (CH );
}
}

Void utf8encode4bytesunicode (STD: vector <unicode4bytes> input,
STD: vector <byte> & output)
{
For (INT I = 0; I <input. Size (); I ++)
{
// 0 xxxxxxx
If (input [I] <0x80)
{
Output. push_back (byte) input [I]);
}
// 110 XXXXX 10 xxxxxx
Else if (input [I] <0x800)
{
Output. push_back (byte) (mask2bytes | input [I]> 6 ));
Output. push_back (byte) (maskbyte | input [I] & maskbits ));
}
// 1110 XXXX 10 xxxxxx 10 xxxxxx
Else if (input [I] <0x10000)
{
Output. push_back (byte) (mask3bytes | input [I]> 12 ));
Output. push_back (byte) (maskbyte | input [I]> 6 & maskbits ));
Output. push_back (byte) (maskbyte | input [I] & maskbits ));
}
// 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
Else if (input [I] <0x200000)
{
Output. push_back (byte) (mask4bytes | input [I]> 18 ));
Output. push_back (byte) (maskbyte | input [I]> 12 & maskbits ));
Output. push_back (byte) (maskbyte | input [I]> 6 & maskbits ));
Output. push_back (byte) (maskbyte | input [I] & maskbits ));
}
// 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
Else if (input [I] <0x4000000)
{
Output. push_back (byte) (mask5bytes | input [I]> 24 ));
Output. push_back (byte) (maskbyte | input [I]> 18 & maskbits ));
Output. push_back (byte) (maskbyte | input [I]> 12 & maskbits ));
Output. push_back (byte) (maskbyte | input [I]> 6 & maskbits ));
Output. push_back (byte) (maskbyte | input [I] & maskbits ));
}
// 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
Else if (input [I] <0x8000000)
{
Output. push_back (byte) (mask6bytes | input [I]> 30 ));
Output. push_back (byte) (maskbyte | input [I]> 18 & maskbits ));
Output. push_back (byte) (maskbyte | input [I]> 12 & maskbits ));
Output. push_back (byte) (maskbyte | input [I]> 6 & maskbits ));
Output. push_back (byte) (maskbyte | input [I] & maskbits ));
}
}
}

Void utf8decode4bytesunicode (STD: vector <byte> input,
STD: vector <unicode4bytes> & output)
{
For (INT I = 0; I <input. Size ();)
{
Unicode4bytes ch;

// 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
If (input [I] & mask6bytes) = mask6bytes)
{
Ch = (input [I] & 0x01) <30) | (input [I + 1] & maskbits) <24)
| (Input [I + 2] & maskbits) <18) | (input [I + 3]
& Amp; maskbits) <12)
| (Input [I + 4] & maskbits) <6) | (input [I + 5] & maskbits );
I + = 6;
}
// 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
Else if (input [I] & mask5bytes) = mask5bytes)
{
Ch = (input [I] & 0x03) <24) | (input [I + 1]
& Amp; maskbits) <18)
| (Input [I + 2] & maskbits) <12) | (input [I + 3]
& Amp; maskbits) <6)
| (Input [I + 4] & maskbits );
I + = 5;
}
// 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
Else if (input [I] & mask4bytes) = mask4bytes)
{
Ch = (input [I] & 0x07) <18) | (input [I + 1]
& Amp; maskbits) <12)
| (Input [I + 2] & maskbits) <6) | (input [I + 3] & maskbits );
I + = 4;
}
// 1110 XXXX 10 xxxxxx 10 xxxxxx
Else if (input [I] & mask3bytes) = mask3bytes)
{
Ch = (input [I] & 0x0f) <12) | (input [I + 1] & maskbits) <6)
| (Input [I + 2] & maskbits );
I + = 3;
}
// 110 XXXXX 10 xxxxxx
Else if (input [I] & mask2bytes) = mask2bytes)
{
Ch = (input [I] & 0x1f) <6) | (input [I + 1] & maskbits );
I + = 2;
}
// 0 xxxxxxx
Else if (input [I] <maskbyte)
{
Ch = input [I];
I + = 1;
}
Output. push_back (CH );
}
}

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.