Character Set code writing and underlying Exploration

Source: Internet
Author: User

Code Writing considerations:

1. When using string constants to construct string objects, use macro L when std: wstring is used; Use macro _ T to construct CString and other objects with variable word width configured;

2. When using a system function, You must select an API that may change with the system configuration, for example: GetDirect (), and input a variable-width String object, such as LPCTSTR,
It is not recommended that: GetDirectA () or: GetDirectW () be fixed to death;

3. When a foreign parameter is required to pass a value to a local code parameter or the opposite, if one of the parameters can change with the configuration, and the other is set to dead
We recommend that you use macros to separate code: # ifdef _ UNICODE
M_img = new Gdiplus: Image (szFilename, FALSE );
# Else
M_img = new Gdiplus: Image (A2W (szFilename), FALSE );
# Endif

4. when constructing a CString Class Object, use _ T. Otherwise, although it can be compiled (internal conversion), CString ("") displays garbled characters even on non-local language machines compiled with UNICODE.
I like CString's efficiency, ease of use, and flexibility. Its internal reference counting mechanism and variability with the configured word width are good.

5. Replace common Regular Expressions:
_ T \ ({"[^"] * "} \)
L \ 1

MessageBox \ (L {"[^"] * "}
MessageBox (_ T (\ 1)

MessageBox \ ({[^,] *}, L {"[^"] * "}
MessageBox (\ 1, _ T (\ 2)

MessageBox \ ({[^,] *}, {[^,] *}, L {"[^"] * "}
MessageBox (\ 1, \ 2, _ T (\ 3)


The C language was originally designed in an English environment. The main character set is a 7-bit ASCII code. Since then, 8-bit bytes have become the most common character encoding units, but international software must be able to represent different characters. iso c can standardize two methods to represent large character sets: wide character (wide character, each character in the character set uses the same bit length) and Multi-byte character (multibyte character, each character can be one to multiple bytes, the character value of a byte sequence is determined by the Environmental Background of the string or stream ).
Note: Although C now provides an abstraction mechanism to process and convert different types of encoding sets, the language itself does not define or specify any encoding sets, or any character set (except the basic source code Character Set and basic running character set ). In other words, this part specifies how to encode wide characters and what type of multi-byte character encoding mechanism is to be supported by individual implementation versions.
Since the addition in 1994, C not only provides the char type, but also provides the wchar_t type (wide character), which is defined in the stddef. h header file. The wchar_t type is sufficient to indicate any element of an extended version character set. Although the C Standard does not support Unicode character sets, many implementations use the Unicode conversion format UTF-16 and UTF-32 to handle wide characters.
C provides some standard functions to convert multi-byte characters to wchar_t, or to convert wide characters to multi-byte characters. For example, if the C compiler uses the Unicode Standard UTF-16 and UTF-8, the following function gets a multi-byte representation of characters (Note: wctomb = WideCharToMultiByte ()).


String, wstring. The template prototype of the string class is basic_string.
String can fully store Chinese characters (valid encoding is only "\ 0" "= 0, other characters are not 0), but in the display, Character Control and other aspects is not guaranteed to be correct!
The former string is a common type and can be seen as char []. In fact, this is exactly the same as _ Elem = char in the string definition. Wstring applies the wchar_t type, which is a wide character used to meet non-ASCII character requests, such as Unicode encoding, Chinese, Japanese, and Korean. For the wchar_t type, in fact, both C ++ use the wchar_t function corresponding to the char function, because they are all defined from the same template similar to the style format above. So there are also wcout, wcin, werr and other functions. In fact, string can also be used in Chinese, but it will write a Chinese character in 2 char, but some function operations error. If a Chinese character is regarded as a unit wchar_t, only one unit is occupied in the wstring, and other non-English characters and codes are also used. In this way, we can truly satisfy string-manipulating requests, especially international operations.


(CSDN) Character Set Configuration in VS2010: defines whether _ UNICODE or _ MBCS should be set. The connector entry point is also affected in the appropriate place.
Any ASCII character can be expressed as a wide character by using the letter L as the character prefix. For example, L' \ 0' is a NULL character with a width (16 bits. Similarly, any ASCII string can be expressed as a wide character string by using the letter L as the prefix of the ASCII string (L "Hello.
The C Runtime Library has two types of internal code pages: region setting and multi-byte. You can change the current code page during program execution (For information about setlocale and _ setmbcp functions, see the documentation ). In addition, the runtime database can obtain and use the value of the operating system code page. In Windows 2000, the operating system code page is the "System Default ANSI" code page. This code page remains unchanged during program execution.


Internal Code refers to the character encoding in the operating system. The internal code of the early operating system is language-related and uses 7-bit ASCII code. To process Chinese characters, programmers have designed Windows internal code GBK for simplified Chinese characters. Currently, the Windows kernel supports the Unicode Character Set and uses the code page to adapt to various languages. The concept of "Internal code" is vague. Microsoft generally describes the encoding specified by the default code page as an internal code, that is, the default encoding used to interpret characters. For example, in Windows notepad, a text file is opened, which contains byte streams: BA, BA, D7, and D6. How should I explain it in Windows? Is it interpreted according to Unicode encoding or GBK?
The answer is that Windows interprets byte streams in text files according to the current default code page. The default code page can be set through the region option of the control panel. There is an ANSI in saving notepad as it is actually saved according to the encoding method of the default code page.
The internal code of Windows is Unicode, which technically supports multiple code pages at the same time. As long as the file shows the encoding you are using and the corresponding code page is installed, Windows will display the code correctly. Unicode is also a character encoding method, but it is designed by international organizations to accommodate all languages and texts in the world.

The earliest idea of Unicode encoding is to store every code point in two bytes, which leads to misunderstanding by most people, so the concept of a very intelligent UTF-8 was introduced to store Unicode code points corresponding to strings.
The UTF-8 is the 8-bit encoding of the UCS (Unicode character set.
Readers can use NotePad to test whether our encoding is correct. Note that UltraEdit automatically converts to UTF-16 when opening a UTF-8-encoded text file, which may produce confusion. You can disable this option in settings. A better tool is Hex Workshop.

 

In C language, the term character (character) has two levels of meaning: writing the character of the source program and the character processed by the program.
For example, printf ("Hello, C! \ N ");" Hello \ n "is the character to be processed by the program.
In a sense, editing/compiler is a software that accepts character input and outputs executable files. It generates executable files and loads them into programs in memory, this program usually needs to process characters as well.
The Edit/compiler processes the characters used to write the C language source program. The character set is called the source character set (sourcecharacter set ). The set of characters to be processed by the application is called the execution character set ). These two are different from the character set during encoding and their configuration concepts.


From hlfkyo's column

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.