Comparison of Unicode character sets with multibyte character sets

Source: Internet
Author: User
Tags character set comparison linux

Today, because of the need to read the directory and files in Windows, fortunately before doing this work (see "Under Linux and Windows Traversal directory method and how to achieve a consistent operation", encapsulated in Windows and Linux read directory and file operation function), Of course directly to use, but did not expect to compile in the VS2012 when the following error occurred:

Error C2664: ' Findfirstfilew ': cannot convert parameter 1 from ' char [a] ' to ' LPCWSTR '

Locate the error source, and the pointer moves over the error, showing: argument of type "char *" is incompatible with parameter of type of "LPCWSTR"

That is, the type char * and LPCWSTR is not compatible with the problem, immediately looked up the reason, the original character set problem, my vs default character set is the Unicode character set (Unicode Character set), change it to a multibyte character set (multi-byte Character Set) is OK, under vs. the specific operation scenario is as follows:

Right-click the Project project, select Properties-configuration properties-general, and then, for example, have a Character set on the right, and change it from "Unicode Character set" to " Multi-Byte Character Set "can be.

Back to the column page: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/extra/

The problem is solved, but the nature of the problem is not very clear, so I looked up the next Unicode character set and multibyte character set differences and relationships, see a talk about a clear narrative, blue fonts belong to the reprint part:

Characters are not usually saved as images in a computer, each character is represented by an encoding, and which encoding is used for each character, depending on which character set (CharSet) is used.   At first, there was only one character set--ansi's ASCII character set on the Internet, which used 7 bits to represent a char, representing a total of 128 characters, including English letters, numbers, punctuation, and so on. It then expands, using 8 bits to represent a single character, representing 256 characters, and adding special symbols such as tabs, based on the original 7 bits character set.   Later, due to the accession of various countries, ASCII can not meet the needs of information exchange, so, in order to be able to express the language of other countries, based on the ASCII set up their own character set, these derived from the ANSI standard character set is used collectively referred to as the ANSI character set, Their formal name should be MBCS (Multi-Byte chactacter system, or multibyte character systems). These derived character sets are characterized by ASCII 127 bits, which are compatible with ASCII 127, and they use the encoding greater than 128 as a leading byte, followed by the second (or even third) character after leading byte and leading Byte together as the actual encoding. There are a lot of such character sets, and our common GB-2312 is one of them.   For example, in the GB-2312 character set, the "Connected" encoding is the C1 AC CD A8, where C1 and CD are leading Byte. The first 127 encodings are standard ASCII reservations, for example, the code for "0" is 30H (30H is hexadecimal 30). When the software is read, if you see 30H, know that it is less than 128 is standard ASCII, said "0", see C1 greater than 128 to know that there is an additional code behind it, so C1 ac together constitute an entire encoding, in the GB-2312 character set to represent "Lian."   Because each language has its own character set, resulting in the final existence of a variety of character sets is too much, in international communication to often convert character set is very inconvenient. Therefore, a Unicode character set is proposed, which is fixed using bits (two bytes, one word) to represent a single character, representing a total of 65,536 characters. In the world almost all languages of the commonly-use characters included in it, facilitate the exchange of information. Standard Unicode is called UTF-16. Later, for Double-byte Unicode to be able to transmit correctly on an existing single-byte system, UTF-8 was used to encode Unicode in a similar MBCS manner. Attention UTF-8 is encoded and belongs to the Unicode character set. Unicode character sets have multiple encodings, while ASCII is only one, and most MBCS (including GB-2312) are only one. The initial goal of Unicode is to provide mappings for more than 65000 characters with 1 16-bit encodings. But that's not enough, it doesn't cover all the history text, and it doesn't solve the transmission problem (implantation Head-ache ' s), especially in those web-based applications. Existing software must do a lot of work to program 16 bits of data. Therefore, Unicode uses some basic reserved characters to develop three sets of encodings. They are utf-8,utf-16 and UTF-32 respectively. As the name shows, in UTF-8, the characters are encoded in a 8-bit sequence, with one or several bytes representing one character. The greatest benefit of this approach is that UTF-8 retains the ASCII character encoding as part of it, for example, in UTF-8 and ASCII, "a" is encoded in 0x41. UTF-16 and UTF-32 are Unicode 16-bit and 32-bit encoding respectively. For the initial purpose, Unicode is usually referred to as UTF-16.   For example, the Unicode standard encoding UTF-16 ("connected" two words) is: DE 8F 1 A 90 and its UTF-8 code is: E8 BF 9E E9 9A   Finally, when a software opens a text, the first thing it does is decide Which encoding of the character set is used to save the text. Software has three ways to determine the character set and encoding of text:   The most standard approach is to detect the first few bytes of text, such as the following table: Opening byte charset/encodingef BB BF utf-8fe FF utf-16/ucs-2, Little Endianff fe utf-16/ucs-2, big Endianff fe utf-32/ucs-4, little endian.00 fe FF utf-32/ucs-4, Big-endian. For example * * mark, even Pass "Two words of UTF-16 (big endian) and UTF-8 code respectively: FF FE DE 8F 1 a 90EF BB BF E8 BF 9E E9 9A   But MBCS text does not have these at the beginning of the character set mark, more unfortunately, some early and a Some poorly designed software does not * * * The character set marks at the beginning when the Unicode text is saved. Therefore, software cannot rely on this approach. At this point, the software can beA more secure way to determine the character set and its encoding is to pop a dialog box to ask the user, for example, to drag the connected file to MS Word and Word will pop up a dialog box.   If the software does not want to trouble users, or it is not convenient for the user to ask, it can only take their own "guess" method, the software can be based on the characteristics of the entire text to guess which charset it may belong to, which is very likely not allowed. This is the case with Notepad to open the "Connected" file.   We can prove this: after typing "connect" in Notepad, select "Save As" and you will see "ANSI" displayed in the last Drop-down box, then save. When the "Connected" file appears garbled, then click "File"-> Save as ", you will see that the last Drop-down box shows" UTF-8, "which means Notepad thinks the text that is currently open is a UTF-8 encoded text. And we just saved it with the ANSI character set. This indicates that Notepad guesses the character set of the "Connected" file, and thinks it is more like a UTF-8 encoded text. This is because the "connected" two-word GB-2312 code looks more like the UTF-8 encoding, which is a coincidence, not all text. You can use Notepad's open feature to display the "Connect" file by selecting ANSI in the last Drop-down box when you open it. Conversely, if you save the UTF-8 encoding earlier, the problem will not occur if you open it directly.   If you put a "connected" file in MS Word, word also thinks it's a UTF-8 encoded file, but it's not sure, so it pops up a dialog box to ask the user, and when you choose Simplified Chinese (GB2312), you can open it normally. Notepad is simpler at this point, which is consistent with the program's positioning. You need to be reminded that some Windows 2000 fonts cannot display all of the Unicode characters. If you find that some characters are missing from the file, simply change them to a different font. The big endian and little Endianbig endian and little endian are different ways for CPUs to handle multibyte numbers. For example, the Unicode encoding of the word "Han" is 6c49. So when you write to a file, do you write 6C in front of it or write 49 in front? If you write 6C in front, it's big endian. Or the 49 written in front, is little endian. The word "endian" is derived from Gulliver's Travels. The Civil war in Lilliput is rooted in the fact that when eating an egg, it knocks from the head (Big-endian) or from the Little-endian, which has occurred six times,One Emperor gave his life and the other lost his throne. We generally translate endian into "byte order", the big endian and little endian called "large tail" and "small tail". Unicode big endian: The sequence of text bits (units) in a Unicode file created on a Big-endian processor (such as an Apple Macintosh computer), in contrast to the text-bit tuples of files established on an Intel processor. The most important bit group has the lowest address and stores the larger end of the text first. To enable users of this type of computer to access your files, you can select the Unicode Big-endian format.

It should be said that the Unicode character set is more versatile, but reading the directory here because it uses something inside the Windows.h, it requires multibyte character sets, after all, Microsoft's things are not very good compatibility, so if you can, use the Unicode character set better!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.