Using today's time, we studied the differences between ANSI and Unicode, and then wrote down my findings for future reference.
The most common application of ANSI encoding is in the Notepad program in Windows, when creating a Notepad, the default save encoding format is ansi,ansi should be considered a compression encoding, when encountering standard ASCII characters, a single-byte representation when encountering non-standard ASCII characters (such as Chinese), a double-byte representation is used. The Unicode encoding standard has been adopted in a variety of new technologies in recent years, including extensible Labeling Language (XML), Java programming language, and the latest operating system.
The following is an experimental study of the differences between the two:
The first software to be prepared is ultraedit, which is used to compare text, and second to analyze the network byte order of the secondary website: http://bm.kdd.cc/index.asp
To get to the point, create a Notepad document "new text document. txt" in a blank folder, enter "Arial ABC (Carriage return)" (without quotation marks, and finally enter a carriage return after ABC), save and close the document, copy and paste the file directly after selecting it, In the same folder, "Duplicate new text document. txt" is created, open "new text document. txt" again, select "File" in the menu, "Save As", and in the Save As dialog box, under "Encoding", select Unicode. Save, select Replace.
Then open UltraEdit, in the menu choose File, compare files (or press the shortcut key alt+f11), select the first file to compare as "new text document. txt", select the second file to compare as "duplicate new text document. txt", "compare Mode" Select the file, "two-party comparison", "the first file to compare" to "binary", "Editor tile" select "Tile Vertically", click "Compare", the program automatically compares the two text files, and is displayed as 16 binary, as shown in the following figure
According to the analysis, the meaning of its representative is shown in the figure below.
Text documents stored with Unicode encoding:
Text documents stored with ANSI encoding:
When you encode text in Unicode, the first two bytes must be FF FE, which is used to identify the document as Unicode encoding. Let's look at the coding section of the content below.
Chinese, as a non-ASCII character, it is not possible to use a single byte to represent a Chinese character, at least two bytes to represent, so, Chinese is a double-byte character, the following figure is the query on the http://bm.kdd.cc/index.asp "song" Two characters, Hexadecimal content with Unicode encoding and ANSI encoding, respectively.
Unicode-encoded "Arial":
ANSI-coded "Arial":
In Unicode encoding, "song" This character encoding is 5 B 8B, according to the binary theory, 5B is high eight bits, 8B is low eight bit, however, compared with the results noted above, in Unicode encoded text file, the first storage is 8B this low eight bits, and then stored 5B this high eight bits, This is when Windows internally handles Unicode characters differently from other systems (such as Mac OS), Windows processes the lower eight bits of the Unicode characters first, and then handles the high eight bits, while some systems handle the high eight bits first, then the lower eight bits, This is why you should specify "network byte order" on the Internet. (Correction: The local byte order Processing order is only related to CPU architecture, operating system-independent, previously mistaken for Mac OS and Windows because the Mac machine was previously using a PPC processor, the processor is in the big-end alignment, and starting from Mac OS 10.4, the support for Intel x86 CPU system, the MAC machine byte order based on the Intel x86 architecture processor becomes small-aligned. In addition, the Unicode encoding described herein is not rigorous and should be UTF16 encoded. Hereby corrected)
In the ANSI code, there is no problem, "song" ANSI code for CB CE, in the storage of these characters is also in accordance with the first high eight-bit, after the low eight-bit way to store.
The above discusses the characteristics of Chinese in Unicode and ANSI encoding, and the following is a look at the characteristics of ASCII characters in these two encodings:
In Unicode, all characters are stored in two bytes (2011.6.22 corrections: Not all characters are stored in the UTF-16 encoded format in two bytes.) Imagine, if you only use two bytes to store a character, the encoding space is 65,536, this number even Chinese is not complete. The previous understanding is biased, and the UTF-16 encoding is stored in two bytes as the base encoding unit. If a character exceeds the space represented by the two bytes, it will request a further two bytes to encode. correct), and ASCII characters can be represented in only one byte, then the contents of the other byte will be set to 00. The disadvantage of using Unicode is that if an article is full of English, then encoding the storage in Unicode will increase the storage space by approximately one-fold (since the head also has a two-byte FF FE logo), However, the advantage of Unicode encoding is that it is suitable for text in different languages in the same document, so Unicode encoding is widely used in XML languages and in the programming of multilingual programs.
In the second photo of this article, you can see that the uppercase English letter A with Unicode encoding is encoded as 00 41 (which previously explained that Windows handles Unicode characters first with a low eight bit and a post processing height of eight bits). Because any character stored in Unicode consumes 2 bytes of space, it takes two bytes and two bytes at the time of decoding. If a high eight-bit is found not to be 00, it is considered that the two bytes represent a non-ASCII character, and conversely if the high eight bit is found to be 00, then it is known that the character is an ASCII character, then the low eight bits are removed, and then the corresponding characters are traced according to the ASCII code table, because the lower eight bits taken , so the character space is 2 of 8, which is 256, so ASCII characters that use Unicode-encoded tables belong to the extended ASCII character set.
As can be seen in the ANSI encoding interpretation of the second group, the storage of an uppercase English letter A is only one byte, the content is 41. Hexadecimal 41 converted to eight-bit binary should be 01000001, you can see that the highest bit of this binary number is 0,ansi encoding in the storage of ASCII characters using the traditional ASCII character set, the number of characters is 128, just 2 7 times is 128, Therefore the highest bit must be 0. Chinese characters "song" ANSI code for CB CE, the hexadecimal number of the two bytes into binary, the result is [11001011][11001110], each byte of the highest bit is 1, thus can infer at the time of decoding, read one byte of the content at a time, Take a look at whether the highest bit of the byte is 1, if it is 1, the byte is staged, and the next byte is read, the highest bit of the newly read byte should also be 1, which merges two bytes and then queries the corresponding character; if the first read of a byte is the highest bit of 0, Then we can find the corresponding characters by querying the traditional ASCII code table directly according to the contents of this byte.
finally analyze the carriage return feature in Windows. At the beginning, in order to prepare the text document for this experiment, I entered a carriage return after entering ABC. But through analysis, in the text storage is not only a "carriage return", but also saved a "newline", and is stored "carriage return" after the storage of "newline" (see ASCII Code table:0d-> carriage return;0a-> line), which is different from the Linux/unix in the way of line wrapping , the text can be wrapped in only one 0D (carriage return) in Linux/unix. If a text document written in Linux/unix is copied directly to Windows (the simplest source of Baidu's home page can be viewed under Windows), you will see that the text is almost always attached, and there is no line break, because the document does not explicitly store 0A ( Line break), although this article looks normal in Linux/unix.