About the Unicode character set

Last Update:2015-03-06 Source: Internet

Author: User

Tags control characters uppercase letter

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

About the Unicode character set(2011-10-20 20:54:03)

The initial Unicode encoding is a fixed-length, 16-bit, or 22-byte representation of a character, which can represent a total of 65,536 characters. Obviously, it is not enough to represent all the characters in a variety of languages. The Unicode4.0 specification takes this into account, defines a set of additional character encodings, which are represented by 2 16-bit characters, so that up to 1,048,576 additional characters can be defined, and currently unicode4.0 only defines 45,960 additional characters.

Unicode is just an encoding specification, and there are currently three types of Unicode encodings actually implemented: Utf-8,ucs-2 and UTF-16, and three Unicode character sets can be converted according to specifications.

UTF-8

UTF-8 is a 8-bit Unicode character set, the encoding length is variable, and is a strict superset of the ASCII character set, meaning that the encoding of each character in ASCII is exactly the same in UTF-8. In the UTF-8 character set, a character may be 1 bytes, 2 bytes, 3 bytes, or 4 bytes long. In general, the alphabetic characters in Europe are 1 to 2 bytes long, while most of the characters in Asia are 3 bytes, and the additional characters are 4 bytes in length.

The UTF-8 character set is universally supported in UNIX platforms, HTML and most browsers also support UTF-8, while window and Java support UCS-2.

Key Benefits of UTF-8:

Less storage space is required for European alphabetic characters.
Easy migration from ASCII character set to UTF-8.

UCS-2

The UCS-2 is a fixed-length 16-bit Unicode character set. Each character is 2 bytes, and UCS-2 supports only unicode3.0, so additional characters are not supported.

Advantages of UCS-2:

The storage space requirement for Asian characters is less than UTF-8, because each character is 2 bytes.
Characters are processed faster than UTF-8 because they are fixed-length encoded.
Support for Windows and Java is better.

UTF-16

The UTF-16 is also a 16-bit coded character set. In fact, UTF-16 is the support of UCS-2 plus additional characters, which is the UCS-2 that conforms to the unicode4.0 specification. So UTF-16 is a strict superset of UCS-2.

The characters in the UTF-16 are either 2 bytes or 4 bytes. UTF-16 is mainly used in versions above Windows2000.

The advantages of UTF-16 relative UTF-8 are consistent with UCS-2.

Oracle provides support for Unicode from 7.0 onwards. The Oracle version of the Unicode character set branch is mainly:

Al32utf8

A UTF-8 coded character set that supports the latest unicode4.0 standards. The character length is 3 bytes, and the additional character is 4 bytes long.

UTF8

Support for unicode3.0 UTF-8 encoding method. Because additional characters are presented in unicode3.1, the UTF8 does not support additional characters. However, unicode3.0 has reserved the encoding space for additional characters, so it is possible to insert additional characters into the UTF8 database, except that the database separates the characters into two parts, which takes up to 6 characters in length. Therefore, if you need to support additional characters, it is recommended that you switch the character set of the database to the new Al32utf8.

The UTF8 can be used in the database character set and also in the national character set.

Utfe

UTFE is a Unicode character set based on the EBCDIC platform, just like UTF8 on an ASCII platform. The difference is that, in Utfe, each character may account for 4 or three bytes, while additional characters require 2 4 bytes, or 8 bytes.

Al16utf16

AL16UTF16 is a UTF-16 encoded Unicode character set that is used in Oracle for the national character set.

Al24utffss

This character set only supports the unicode1.1 specification, which is used in the oracle7.2~8i version and is now obsolete.

CString the next byte in Unicode is 16bit, in ASCII under 8bit, after changing to a char array under what circumstances are the same

The best way to write a program is that the same source file can be compiled under Unicode and can be compiled under ANSI

Engineering--Set up the--c/c++--preprocessor, which can define identifiers such as Unicode,_unicode, whether the identity is compiled in ASCII or by UNICODE

#include <tchar.h>

Char definitions all changed to Tchar,tchar defined as char or WCHAR depending on the setting
Strings are added with text macros, such as text ("Hello"), and are defined as ANSI or Unicode versions, depending on the compiler's settings
The string also has a majority of its generic versions:

The maximum length version is one more parameter than the standard version, which indicates the length of the buffer
With v whose arguments are parameter list pointers, use Va_list, Va_start, and Va_end macros

C-provided string function: ASCII wide character general form

1. Variable parameters:
Standard Version sprintf swprintf _stprintf
Max length version _snprintf _snwprintf _sntprintf
Version WindowsNT WSPRINTFA WSPRINTFW wsprintf

2. Pointers to arrays as parameters:

Standard Version vsprintf vswprintf _vstprintf
Max length version _vsnprintf _vsnwprintf _vsntprintf
Version WindowsNT WVSPRINTFA WVSPRINTFW wvsprintf

The following references to "Windows programming"
American Standard

The early computer's character code was developed from the Hollerith card (known as not being folded, curled, or mutilated) and was invented by Herman Hollerith and was first used in the 1890 U.S. Census. The 6-bit character code system bcdic (binary-coded Decimal Interchange Code: Binary Encoded Decimal Interchange Encoding) originated from Hollerith code, was gradually extended to 8-bit EBCDIC in the 60 's, and has been the standard of IBM mainframe, But it's not used anywhere else.

The U.S. Information Interchange Standard Code (Ascii:american Standards Code for information Interchange) started in the late 50 and was last completed in 1967. In the process of developing ASCII, there is a great controversy over the question of whether the character length is 6-bit, 7-bit, or 8-bit. The substitution character should not be used from a reliability standpoint, so ASCII cannot be 6-bit encoded, but the 8-bit version of the scenario is excluded for cost reasons (the cost of storage per bit is still expensive). In this way, the final character code will have 26 lowercase letters, 26 capitals, 10 digits, 32 symbols, 33 handles, and a space with a total of 128 character codes. ASCII is now recorded in the ANSI x3.4-1986 character set-the 7-bit U.S. National standards Code for information exchange (7-bit ascii:7-bit American Nation Standard Code for information Interchange), issued by the National Standards Institute (American Nation Standards Institute). The ASCII character code shown in Figure 2-1 is similar to the format in an ANSI file.

ASCII has many advantages. For example, the 26-letter code is sequential (not in the EBCDIC code); uppercase and lowercase letters can be transformed from one data to another, and 10 numbers of code can be conveniently obtained from the value itself (in the Bcdic code, the character "0" is encoded behind the character "9"!). ）

Best of all, ASCII is a very reliable standard. On keyboards, video display cards, system hardware, printers, font files, operating systems, and the Internet, other standards are not as popular and ingrained as ASCII code.

　

Figure 2-1 ASCII Character set

International aspects

The biggest problem with ASCII is the first letter of the abbreviation. ASCII is a true American standard, so it is not good enough to meet the needs of other English-speaking countries. For example, where is the British pound sign (￡)?

English uses the Latin (or Roman) alphabet. In writing languages that use the Latin alphabet, the words in English often require no accent marks (or diacritics). Even if there are no improper English words, such as C mudfish perate or résumé, in which the traditional formula and the pronunciation symbol are not correct, the spelling does not have a phonetic symbol to be fully accepted.

But in many countries south and north of the United States, as well as in the Atlantic, it is common to use pronunciation symbols in languages. These accents were originally intended to make the Latin alphabet suitable for these languages to be pronounced differently. In the Far East or in the south of western Europe, you will encounter languages that do not use the Latin alphabet at all, such as Greek, Hebrew, Arabic, and Russian (using the Slavic alphabet). If you go farther east, you will find Chinese pictographic characters, and Japan and North Korea also adopt the Chinese character system.

The history of ASCII began in 1967, and since then it has been focused on overcoming its own limitations in other languages that are more appropriate for non-American English. For example, in 1967, the International Organization for Standardization (Iso:international standards Organization) recommended a variant of ASCII, code 0x40, 0x5b, 0x5C, 0x5d, 0x7B, 0x7c, and 0x7d" Reserved for country use, while Codes 0x5e, 0x60, and 0x7e are labeled "when the special characters required in the country require 8, 9 or 10 space positions, they can be used for other graphical symbols". This is clearly not an optimal international solution, as this does not guarantee consistency. But it shows how people try to encode different languages.

Extended ASCII

In the early days of small computer development, 8-bit bytes have been strictly established. Therefore, if you use a byte to hold the character, you need 128 additional characters to complement the ASCII. In 1981, when the original IBM PC was launched, the video card ROM burned with a 256 character character set, which also became an important part of the IBM standard.

The original IBM extended character set includes some accented characters and a lowercase Greek alphabet (useful in mathematical notation), as well as some block and line graphic characters. Additional characters are also added to the encoding position of the ASCII control character, because most control characters are not used for display.

The IBM extended character set is burned into the ROMs of countless display cards and printers, and is used by many applications to decorate the way their text patterns are displayed. However, the character set does not provide enough accented characters for all Western European languages that use the Latin alphabet, and it does not apply to Windows. Windows does not require graphics characters because it has a fully graphical system.

In Windows 1.0 (released in November 1985), Microsoft did not completely abandon the IBM extended character set, but it was relegated to the second most important position. Because the ANSI draft and ISO standards are followed, the pure Windows character set is called the "ansi character set. The ANSI draft and ISO standard eventually became Ansi/iso 8859-1-1987, i.e. "american National standards for information Processing-8-bit Single-byte Coded Graphic Character sets-part 1:latin Alphabet No 1", usually also abbreviated as"latin 1".

The original version of the ANSI character set was printed in the Windows 1.0 Programmer's Reference, as shown in 2-2.

　

Figure 2-2 Windows ANSI character set (based on Ansi/iso 8859-1)

An empty box indicates that no characters are defined at that location. This is consistent with the final definition of Ansi/iso 8859-1. Ansi/iso 8859-1 shows only graphic characters and no control characters, so del is not defined. In addition, code 0xa0 is defined as a non-breaking space (which means that the character is not used to break a line when it is formatted), and code 0xAD is a soft hyphen (meaning that it is not displayed unless the word is broken at the end of a row). In addition, Ansi/iso 8859-1 defines code 0xd7 as multiplication sign (*) and 0xf7 as Division sign (/). Some fonts in Windows also define certain characters from 0x80 to 0x9f, but these are not part of the Ansi/iso 8859-1 standard.

MS-DOS 3.3 (released in April 1987) introduced the concept of code page to IBM PC users, and Windows also uses this concept. The code page defines the image code for the character. The original IBM character set is called code page 437, or "ms-dos Latin US). Code page 850 is "ms-dos Latin 1", which replaces some linear characters with additional accented letters (but not the Latin 1 Iso/ansi standard shown in Figure 2-2). Other code pages are defined in other languages. The lowest 128 codes are always the same; the higher 128 code depends on the language in which the code page is defined.

In MS-DOS, if a user specifies a code page for the PC's keyboard, display card, and printer, and then creates, edits, and prints the file on the PC, everything is normal and the event remains consistent. However, if a user tries to exchange files with a user who uses a different code page, or if the code page is changed on the machine, the problem arises. The character code is associated with the wrong word typeface. An application can save code page information along with a file to try to reduce the problem, but the policy includes some work that transitions between code pages.

Although the code page initially provided only additional Latin character sets that did not include accented letters, the upper 128 characters of the final code page included the complete non-Latin alphabet, such as Hebrew, Greek, and Slavic. Naturally, this variety can lead to confusing code pages, and if a few accented letters are not displayed correctly, then the entire text becomes cluttered and unreadable.

The extension of the code page is based on all these reasons, but it is not enough. The Slavic language of MS-DOS code page 855 is different from the Slavic language of the Windows code page 1251 and the Slavic code page 10007. The code page in each environment is a standard character set correction made to the environment. IBM OS/2 also supports a variety of EBCDIC code pages.

But wait a minute, you'll find things getting worse.

Double-byte Character set

So far, we've seen a character set of 256 characters. But there are about 21,000 glyphs in China, Japan and South Korea. How to accommodate these languages and still maintain some compatibility with ASCII?

The solution (if this is correct) is a double-byte character set (Dbcs:double-byte character set). DBCS starts with 256 code, just like ASCII. As with any code page that behaves well, the original 128 code is ASCII. However, some of the higher 128 codes always follow the second byte. These two bytes together (called the first byte and followed byte) define a character, usually a complex glyph.

Although Chinese, Japanese and Korean share some of the same hieroglyphics, it is clear that the three languages are different, and often the same hieroglyphics represent three different things in three different languages. Windows supports four different double-byte character sets: code page 932 (Japanese), 936 (Simplified Chinese), 949 (Korean), and 950 (traditional Chinese characters). DBCS is only supported for versions of Windows that are produced for these countries (regions).

The problem with double-character sets is not that characters are represented by two bytes. The problem is that some characters (especially ASCII characters) are represented by 1 bytes. This can cause additional programming problems. For example, the number of characters in a string cannot be determined by the number of bytes in the string. The string must be parsed to determine its length, and each byte must be examined to determine whether it is the first byte of a double-byte character. If there is a pointer to the middle of a DBCS string, what is the address of the previous character of the string? The usual solution is to parse the string from the beginning of the pointer!

Unicode Solutions

The basic problem we face is that the writing language in the world cannot be represented simply by 256 8-bit codes. Previous solutions including code pages and DBCS have proven to be less than satisfying and clumsy. So what is the real solution?

As a program writer, we have experienced this kind of problem. If there are too many things to do with a 8-bit value, then we'll try a wider value, such as a 16-bit value. And it's interesting that it's the reason that Unicode was made. Unlike a confusing 256-character code image, and a double-byte character set that contains some 1-byte code and some 2-byte code, Unicode is a unified 16-bit system, allowing for 65,536 characters to be represented. This is sufficient to represent all the characters and the world's languages that use hieroglyphs, including a set of mathematical, symbolic, and monetary unit symbols.

It is important to understand the difference between Unicode and DBCS. Unicode uses the "wide character set" (especially in the context of C programming languages). Each character in the "unicode is 16 bits wide instead of 8 bits wide. "In Unicode, there is no use of a 8-bit numeric value alone. In contrast, we still handle 8-bit values in the double-byte character set. Some bytes define characters themselves, while some bytes display the need to define a character together with another byte.

Handling DBCS strings is messy, but working with Unicode literals is like working with ordered text. You might be happy to know that the first 128 Unicode characters (16-bit code from 0x0000 to 0x007f) are ASCII characters, and the next 128 Unicode characters (code from 0x0080 to 0X00FF) are ISO 8859-1 extensions to ASCII. Characters in different parts of Unicode are also based on existing standards. This is for ease of conversion. The Greek alphabet uses code from 0x0370 to 0x03ff, Slavic uses code from 0x0400 to 0X04FF, the United States uses code from 0x0530 to 0x058f, and Hebrew uses code from 0x0590 to 0X05FF. Chinese, Japanese, and Korean hieroglyphs (collectively called CJK) occupy code from 0x3000 to 0X9FFF.

The biggest benefit of Unicode is that there is only one character set, no ambiguity. Unicode is actually the result of almost every important company in the personal computer industry working together, and it corresponds to the code in ISO 10646-1 standard one by one. An important reference for Unicode is the Unicode standard,version 2.0 (Addison-wesley Press, 1996). This is a special book that shows the richness and diversity of the written language of the world in rare ways in other documents. In addition, the book provides the basic principles and details for developing Unicode.

Does Unicode have any drawbacks? Of course. A Unicode string consumes twice times the memory of an ASCII string. (compressing files, however, can greatly reduce the amount of disk space the file occupies.) But perhaps the worst drawback is that people are relatively not accustomed to using Unicode. As a program writer, this is our job.

Wide characters and C

For C-Program writers, the 16-character idea is really disappointing. A char and a byte width are one of the most uncertain things. Few programmers know Ansi/iso 9899-1990, which is "the American National Standard programming language-c" (also known as "ansi c") supports character sets that use multiple bytes to represent one character through a concept called "wide character." These wide characters coexist perfectly with commonly used characters.

ANSI C also supports multi-byte character sets, such as the Chinese, Japanese, and Korean versions of Windows supported character sets. However, these multibyte character sets are treated as a single-byte string, but some of these characters change the meaning of subsequent characters. The multi-byte character set mainly affects the C language Program execution period link library function. By contrast, wide characters are justifies than normal characters and can cause some compilation problems.

Wide characters do not need to be Unicode. Unicode is a possible wide character set. However, since the focus of this book is on Windows rather than on the theory of C execution, I will use wide characters and Unicode as synonyms.

Char data type

Suppose we are all very familiar with the use of char data patterns in C programs to define and store characters and strings. But to make it easier to understand how C handles wide characters, let's review the standard character definitions that might appear in the WIN32 program.

The following statement defines and initializes a variable that contains only one character:

char c = ' A ';

The variable c requires 1 bytes to be saved and will be initialized with the hexadecimal number 0x41, which is the ASCII code for the letter A.

You can define a pointer to a string like this:

char * p;

Because Windows is a 32-bit operating system, pointer variable p needs to be saved in 4 bytes. You can also initialize a pointer to a string:

char * p = "hello!";

As before, the variable p also needs to be saved in 4 bytes. The string is saved in static memory and occupies 7 bytes-6 bytes to hold the string, and the other 1 bytes to hold the terminating symbol 0.

You can also define character arrays like this:

Char a[10];

In this case, the compiler retains 10 bytes of storage space for the array. The expression sizeof (a) will return 10. If the array is a global variable (that is, defined outside of all functions), you can use a statement like the following to initialize a character array:

Char a[] = "hello!";

If you define the array as a range variable for a function, you must define it as a static variable, as follows:

static char a[] = "hello!";

In either case, the string is stored in the static program memory and adds 0 at the end, which requires 7 bytes of storage space.

Wide character

Unicode or wide characters do not change the meaning of the char data type in C. Char continues to represent a 1-byte storage space, and sizeof (char) continues to return 1. Theoretically, 1 bytes in C can be longer than 8 bits, but for most of us, 1 bytes (i.e. 1 char) are 8 bits wide.

The wide character in C is based on the wchar_t data type, which is defined in several header files including WCHAR.H, like this:

typedef unsigned short wchar_t;

Therefore, the wchar_t data pattern is the same as the unsigned short integer pattern, which is 16 bits wide.

To define a variable that contains a wide character, use the following statement:

wchar_t c = ' A ';

The variable c is a double-byte value 0x0041, which is a Unicode representation of the letter A. (However, since the Intel microprocessor stores multibyte values from the smallest bytes, the bytes are actually stored in memory in the order of 0x41 and 0x00.) This should be noted if you are checking the computer storage for Unicode text. ）

You can also define pointers that point to wide strings:

wchar_t * p = L "hello!";

Note The uppercase letter L (representing "long") immediately preceding the first quotation mark. This tells the compiler that the string is characters by a wide character-that is, each character occupies 2 bytes. In general, the pointer variable p takes 4 bytes, and the string variable requires 14 bytes-2 bytes per character, and 0 at the end requires 2 bytes.

Similarly, you can define a wide-character array with the following statement:

static wchar_t a[] = L "hello!";

The string also requires 14 bytes of storage space, and sizeof (a) will return 14. The index array A can get individual characters. The value of a[1] is either wide character "e", or 0x0065.

Although it looks more like a print symbol, the L in front of the first quotation mark is important, and there must be no space between the two symbols. Only with l will the compiler know that you need to save the string as 2 bytes per character. Later, when we see the use of a wide string instead of a variable definition, you will also encounter the L in front of the first quotation mark. Fortunately, if you forget to include the L,c compiler, you will usually be given a warning or error message.

You can also use the L prefix in front of a single character literal to indicate that they should be interpreted as wide characters. As shown below:

wchar_t c = L ' A ';

But usually this is unnecessary, and the C compiler expands the character to make it a wide character.

About the Unicode character set

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More