Unicode-related Encoding

Source: Internet
Author: User

The following is reproduced on the Internet.

 

I. Unicode Origin

Unicode (Universal multiple-octet coded character set ):
Currently, the most popular and promising character encoding specification solves the conflict between encoding in different languages.

The initial character encoding ASCII (8-bit, with a maximum of 0) can only represent 128 characters, indicating English, numbers, and some symbols. However, there are more than one language in the world, and even with an extended ASCII code up to 1, there are only 256 characters.
Complicated texts such as Chinese and Japanese Korean and Arabic cannot be used.
As a result, countries have developed their own compatible ASCII code specifications, which are various ANSI codes. For example, gb2312 in China uses two extended ASCII characters to represent a Chinese character. However, these ANSI codes cannot exist at the same time, because their definitions overlap with each other. To use different languages freely, there must be a new encoding to uniformly allocate encoding for various texts.
ISO (International Organization for Standardization) and uicode Association (an association of software manufacturers) started their work respectively. That is, the ISO 10646 project of ISO and the Unicode project of Unicode Association. Later, they began to merge the work results of both parties, using the same font and word code. However, both projects have their own standards.

UCs (Unicode Character Set ):
This is the name of uicode in ISO, with two sets of encoding methods in mind. UCS-2 (UNICODE) represents a character in 2 bytes, and UCS-4 (Unicode-32) represents a character in 4 bytes. The UCS-4 is extended by the USC-2, adding a 2 byte High. Even for the old UCS-2, it can also represent 2 ^ 16 = 65535 characters, basically can accommodate all common national characters, so currently basically use UCS-2.

UTF (UCS Transformation Format ):
Unicode uses two bytes to represent a single character, while ASCII uses one byte. Therefore, there are many conflicts. Previously, ASCII processing methods must be rewritten. In addition, the C language uses/0 as the string end sign, but many characters in Unicode contain/0, and the C language string function cannot process Unicode normally. In order to put Unicode into practical use, UTF emerged, the most common being UTF-8 and UTF-16.
Where the UTF-16 and Unicode themselves are encoded in the same way, the UTF-32 and the UCS-4 are the same. Most importantly, the UTF-8 is fully compatible with ASCII encoding. UTF is a variable-length encoding. Its number of bytes is not fixed. The first byte is used to determine the number of bytes. The first byte is 0, that is, one byte. The first byte is 110, that is, 2 byte and 1110 is 3 byte. The subsequent bytes of the character start with 10. This will not be obfuscated and single-byte English characters can still be encoded in ASCII format. Theoretically, the UTF-8 can represent a maximum of 6 bytes, But Unicode currently does not use a character greater than 0xffff, the actual UTF-8 uses up to 3 bytes.
 

Ii. Definition and conversion of UTF-8, UTF-16 and UTF-32

 

1. Definition
UNICODE: International uniqe character set (UCS) standard.
UTF-16BE: a specific Unicode store encoding, based on Unicode's big-Endian.

UTF-16BE: a specific Unicode store encoding, based on Unicode's little-Endian
UTF-8: a specific 8 leading code Unicode store encoding standard.

2. UTF-8 encoding standard:
U-00000000-U-0000007F: 0 xxxxxxx
U-00000080-U-000007FF: 110 XXXXX 10 xxxxxx
U-00000800-U-0000FFFF: 1110 XXXX 10 xxxxxx 10 xxxxxx
U-00010000-U-001FFFFF: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-00200000-U-03FFFFFF: 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-04000000-U-7FFFFFFF: 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

3. header charset/Encoding
Ef bb bf UTF-8
FF Fe UTF-16/UCS-2, little endian
Fe FF UTF-16/UCS-2, big endian
FF Fe 00 00 UTF-32/UCS-4, little endian.
00 00 Fe FF UTF-32/UCS-4, big-Endian.

4. Chinese characters encoding example:
'Connection' UNICODE: de 8f 1A 90
'Connected' UTF-8: E8 BF 9e E9 80 9A
'Connected' UTF-16BE text: FF Fe de 8f 1A 90
'Connected' UTF-8 text: ef bb bf E8 BF 9e E9 80 9A

5. How to convert among them, Unicode to UTF-8
Pout [0] = (0xe0 | (pchar [1] & 0xf0)> 4 ));
Pout [1] = (0x80 | (pchar [1] & 0x0f) <2) + (pchar [1] & 0xc0)> 6 );
Pout [2] = (0x80 | (pchar [0] & 0x3f ));

Pengpeng personal opinion:

This process is complicated to determine the encoding used by a file:

(1) Some files are pre-defined at the beginning of the file, such as Microsoft Office, e.g. word, outlook, Excel, PowerPoint, etc. These binary files are defined at the beginning of the file. For example, Fe FF represents UTF-16BE.

(2) When reading a byte stream, if the standard byte stream, will add a non-existent character at the beginning of the stream as the encoding type, such as if it is a UTF-16BE, the first two bytes are Fe ff, and if it is a UTF-8 byte stream, the first three bytes are ef bb bf. If this non-existent character is not found, the UTF-16BE is used by default.

(3) If FE ff, ef bb bf, or other bytes appear in the middle of the byte stream, they can be ignored.

(4) General Microsoft platform files are mostly using UTF-16LE situation, is it because interl X86 platform uses little endian reasons? This is a bit interesting. On the contrary, most files on Mac platforms use UTF-16BE.

(5) The Unicode encoding range of Chinese characters is 0080-07ff, so it is 2-byte encoding.

(6) UTF-16BE and UTF-16LE can be used to convert each other (32 bits system ):

# Define convert_32 (X )/
X = (x) & 0xff000000)> 24) | (x) & 0x00ff0000)> 8) | (x) & 0x0000ff00) <8) | (x) & 0x000000ff) <24 ));

 

# Define convert_16 (X )/
X = (x) & 0xff00)> 8) | (x) & 0x00ff) <8 ));

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.