C language was originally designed in English environment, the main character set is 7-bit ASCII code. Starting from this, the 8-bit byte (byte) becomes the most common character encoding unit, but internationalized software must be able to represent different characters, which are so large that they cannot use one byte encoding, so the world has been using a variety of multibyte character-encoding collections for decades, such as "non-Latin "and" non-letter "of the Chinese, Japanese, Korean text system. In 1994, the adoption of the "normative Addendum 1" (Benchmark supplement i) allowed ISO C to standardize two methods for representing large character sets: wide characters (wide character, Each character in the character set uses the same bit length) and multibyte characters (multibyte character, each character can be one to many bytes, and the character value of a byte sequence is determined by the context in which the string or stream (stream) is located).
Note: Although C now provides an abstract mechanism to process and transform different kinds of encoding collections, the language itself does not define or specify any encoding set, or any character set (except the basic source code character set and the basic run character set mentioned in the previous section). In other words, this is part of an individual implementation that specifies how to encode wide characters and what type of multibyte character encoding mechanism to support.
Since the 1994 update, C provides not only the char type, but also the wchar_t type (wide character), which is defined in the Stddef.h header file. The wchar_t type is sufficient to represent any element of an implementation version extended character set.
Although the C standard does not support Unicode character sets, many implementation versions use Unicode conversion format UTF-16 and UTF-32 (reference http://www.unicode.org) to handle wide characters. The Unicode standard is quite close to the ISO/IEC 10646 standard and is a superset of many existing character sets (including 7-bit ASCII). If you follow the Unicode standard, the wchar_t type is at least 16 or 32 bits long, and a value of the wchar_t type represents a Unicode character. For example, the following definition initializes the variable WC to Greek alphabet Alpha.
This escape character begins with "X", followed by a hexadecimal number, which assignments the value represented by the number in the variable. In this case, this character is lowercase alpha. In multibyte character sets, the encoding width of each character is unequal, it can be one byte, or it can be multiple bytes. Both the source code character set and the run character set may contain multibyte characters, and if it does contain multibyte characters, then each character in the basic character set occupies only one byte (no multibyte characters at all), and the null character is the only exception, A null character may occupy any number of bytes (but all bits in these bytes must be 0). Multibyte characters can be used for constants of characters, string literals (string literal), identifiers (identifier), annotations (comment), and header files. Many multibyte character sets are designed to support specific country languages, such as JIS character sets (Japanese industry standard, Japanese Industrial Standard). The multibyte UTF-8 character set is defined by the Unicode Consortium (Universal Code Federation) and can represent all characters in the Unicode character set.
The size of the space used by the UTF-8 character is possible from one byte to four bytes. The main difference between multibyte characters and wide characters (that is, wchar_t) is that the width characters occupy the same number of bytes, while the number of bytes in multibyte characters varies, so that multibyte strings are more difficult to handle than wide strings.
For example, even though the character ' a ' can be represented by a byte, but to find this character in a multi-byte string, you cannot use a simple byte alignment, because even if you find a match in a location, this byte is not necessarily a character, it may be part of another different character. However, multibyte characters are quite appropriate for storing text as a file (see chap. 13th).
C provides standard functions for converting multibyte characters to wchar_t, or for converting wide characters to multibyte characters. For example, if the C compiler uses the Unicode standard UTF-16 and UTF-8, the following call to the Wctomb () function can get a multi-byte representation of character Alpha (note: Wctomb = wide character to multibyte).
wchar_t WC = L ' "X3b1 '; lowercase Greek letter Alpha,α
char mbstr[10] = "";
int nbytes = 0;
Nbytes = Wctomb (mbstr, WC); |
After calling this function, the mbstr array will get multibyte characters, in this case, the "xce" xB1 symbol. The return value of this wctomb () function is "the number of bytes required", in this case, the value assigned to the variable nbytes is 2, meaning that the Greek lowercase alpha is required to occupy two bytes in multibyte characters.