Unicode, character, Character Set, encoding, UTF-8

Source: Internet
Author: User
Turn: http://www.utf.com.cn/article/s1383

 

These related things are not complicated, but they are very easy to tell, especially recently I have read some of theseArticleEven if it is regarded as the source of authority, conflicts often occur, and the words are inaccurate and the concepts of interpretation are unclear:

1. the character set and encoding scheme are mixed. The http://www.utf.com.cn/article/s320 says:

Utf_8 Character Set

UTF-8 is a variable-length character encoding of Unicode
The last sentence, but the first sentence, UTF-8 is a possible encoding scheme for the Unicode character set, it is not a character set.

2. the character set only defines a virtual, computer-independent character set, which specifies the characters in these sets. Each character is assigned a number, and the number is not encoding, number is the concept of code point in Unicode terms. These characters do not have to be unique in the shape that can be seen by external human eyes.

3. unicode assigns a unique code point to each character. It is only a mathematical number. It should not be associated with a certain representation of the value in the computer, it is determined by the encoding scheme. the number in this concept, the article in Joel on software"
The absolute minimum every software developer absolutely, positively must know about Unicode and character sets (no excuses !) The author in the chapter calls it Plato.


4. Whether the maximum UTF-8 value is 4 bytes or 6 bytes

The original specification allowed for Sequences of up to six bytes covering numbers up to 31 bits (the original limit of the universal character set ). however, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U + 0000 to U + 10 FFFF, in November 2003.
Once 6 bytes, now 4 bytes, both are correct, but if you see the article about UTF-8 said 6 bytes, basically can judge this is a relatively early article. this is different from the IPv6 situation. IPv6 is the first 4 bytes and eventually extended to 16 bytes.

5. Methods for judging ANSI/Unicode text
Http://www.cppblog.com/liangbo/archive/2006/04/23/6103.html:

How can I determine whether a text file is ANSI or Unicode?
It is determined that if the first two bytes of a text file are 0xff and 0xfe, Unicode is used; otherwise, ANSI is used.

You can't bear to say it is wrong, but you can never say it is right. It is very easy to mislead people. The BOM represented by the first two bytes is just a convention, you have no way to ensure what the subsequent content of the file is. Maybe it is neither ASCII nor Unicode, and it cannot be gb2312 encoded? Cannot be a binary file? Bom is just an agreement that makes sense only when both parties strictly abide by it. Of course you can use this agreement to cheat. and. this argument also ignores Bom's distinction between little-Endian and big-Endian for UTF-16, where only one case is given. other situations fall into its "otherwise ". there is no small header on the UTF-8, but there is also a three-byte special markup Convention it is UTF-8 encoding, UTF-8 encoding of course is also a unicode encoding.

The true answer to this question may be disappointing: there is no way to judge, but this is a theoretical statement. In fact, not so boring people deliberately make bad decisions and make a non-one file. imagine that the HTTP header is not sent to IE Content-Type, And the htmlCodeWhat should I do if <meta http-equiv = "Content-Type" charset = "text/html; UTF-8"/> does not exist in the system? How does it know what your webpage code is? By guessing! According to the content, sometimes it will guess wrong, so you will need to select a different encoding scheme to re-display to avoid gibberish.

6. Is Unicode ucs2 a 16-bit Unicode?
Unicode is a character set, with a specific number of representation does not matter. Of course, the Unicode standard also specifies the encoding scheme of the character set, you can think that the UTF-8 and so on are Unicode Standard equally.

Http://www.cppblog.com/liangbo/archive/2006/04/23/6103.html

14. Differences between Unicode and DBCS
Unicode (especially in CProgramIn the design language environment) "wide character set ". 「 Every character in Unicode is a 16-Bit Width, not an 8-Bit Width .」 In Unicode, there is no meaning to simply use an 8-bit value.

The above statement is ????

7. UTF-16 is a 16-Bit fixed length encoding scheme?
Nooooooooooooo!
Http://www.answers.com/topic/utf-8-1

UTF-16 is often mistaken to be constant-length, leading to code that works for most text but suddenly fails for non-BMP characters

The characters defined in BMP can be encoded by 16 characters, that is, a UTF-16 of only one word (word, 2 bytes.
Plane 0 (0000-ffff ):Basic multilingual plane(BMP)
Therefore, Windows API, wchar/w_char (w_char can be 4 bytes from the Language Perspective), and char in Java/C # only supports BMP.

Although the UTF-16 is variable-length encoding, it is not like the UTF-8, it can be 1, 2, 3, 4 bytes, it can only be 2 or 4 bytes.

8. How many characters can Unicode contain? Is it dubyte?
Unicode was originally a "double-byte," or 16-digit, binary number (see numeration) code that cocould represent up to 65,536 items. no longer limited to 16 bits, it can now represent about one million code positions using three encoding forms called Unicode Transformation formats (UTF)

UnicodeReserves 1,114,112 (= 220 + 216Or17 × 216, hexadecimal 110000) code points.

As of Unicode 5.0.0, 101,063 (9.1%) of these codepoints are assigned, with another 137,468 (12.3%) reserved for private use, leaving 875,441 (78.6%) unassigned

UCS-2-a 16-bit, fixed-width encoding that only supports the BMP, considered obsolete

BMP is distributed by usage frequency, so UCS-2/or Microsoft's limited Unicode support may be enough in reality

9. What is the built-in Unicode in windows?
UTF-16 is the standard format for the Windows API (though surrogate support is not enabled by default), because surrogate support is not supported, Windows API cannot support all Unicode characters.

UTF-16 is the native internal representation of text in the Microsoft Windows NT/2000/XP/CE, Qualcomm brew operating systems; the Java and. net bytecode environments; Mac OS X's cocoa and core Foundation frameworks; and the QT cross-platform graphical Widget Toolkit.
UTF-16 as the internal representation of the program is the most parade, as the UTF-8 as the transmission and storage is the most popular, because the UTF-16 brings the advantage of the processing efficiency of the program.

Older windows NT systems (prior to Windows 2000) only support UCS-2
Does XP support UTF-16? So what about its. NET and wchar? Define _ ucs4 macro?

10. gb18030
Gb18030 is another encoding form for Unicode, fromStandardization Administration of China. It is the official character set of the People's Republic of China (PRC)

11. In terms of concept, since UCS (Universal Character Set) is a character set and different from the encoding scheme, why do they have to carry-2,-4?
UCOS indicates the character set, but UCS-2 is the encoding scheme.
-2 and-4 are indeed represented in bytes (8-bit groups), but the UCS-2 is speaking of an encoding scheme, which is an early version of the encoding UCs, prior to the appearance of the UTF-16, I understand that 2 means theoretically it can be expressed in 2 bytes. while the actual encoding scheme, such as UTF-8 to encode the UCS-2 character set, because the UCS-2 character set contains all BMP, it is possible that the UTF-8 encoding UCS-2 needs 3 bytes. but not 4 bytes.

Here there is indeed some confusion, the character set followed by the number to represent the byte, And the encoding scheme followed by the number of digits, such as UTF-16, this number of digits is the minimum number of digits used for encoding a character, it is not the number of digits that a character will eventually occupy after being encoded.

12. Big Head and small head
Fe ff (in hexadecimal) for big-Endian ubuntures, or FF fe for little-Endian

Technically, with the UTF-16 scheme the BOM prefix is optional, but omitting it is not recommended as UTF-16LE or UTF-16BE shocould be used instead. if the BOM is missing, barring any indication of byte order from higher-level protocols, big endian is to be used or assumed. the BOM is not optional in the UCS-2 scheme.
Technically, Bom is optional in UTF-16 encoding, but it is not recommended to ignore it, but should specify UTF-16LE or UTF-16BE explicitly. if Bom is not specified, and other methods are not used to specify the header, the header is used by default.
Bom is not optional for UCS-2.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.