Character encoding

Source: Internet
Author: User


Every programmer will inevitably encounter the problem of character encoding, especially to do web development programmer, "garbled problem" has always been a headache problem, perhaps you have rarely encountered "garbled" problem, however, to solve the inherent principle of garbled method, do you understand? I as a programmer, in the character coding also encountered a lot of problems, and has been the various coding mengmengdongdong, not clear, in the work has also encountered a very annoying coding problems. These two days on the internet to collect a lot of coding information, the character encoding is understood to be more clear. Below, I think the more important knowledge points recorded, on the one hand to facilitate the review later, on the other hand also want to give me as mengmengdongdong as a reference. If it is wrong or inappropriate, please criticize it.

Before doing so, learn some useful concepts: "Character set", "character encoding" and "Inner Code". 1. Character set and character encodingCharacters are all kinds of words and symbols, including various national text, punctuation, graphic symbols, numbers and so on. A character set is a collection of multiple characters, with a variety of character sets, each with a different number of characters, and common character sets: ASCII character set, ISO 8859 character set, GB2312 character set, BIG5 character set, GB18030 character set, Unicode character set, and so on.  To accurately handle a variety of character set text, the computer needs character encoding, so that the computer can recognize and store a variety of text. Encoding (encoding) and character set are different. The character set is only a collection of characters, not necessarily suitable for network transmission, processing, and sometimes must be encoded (encode) before it can be applied. such as Unicode can be based on different needs of UTF-8, UTF-16, UTF-32 and other ways to encode. Character encodings are characters that correspond to character sets in binary numbers. Therefore, the encoding of characters is the technical basis of information exchange. which characters to use. In other words, the characters, letters and symbols are in the income standard. The collection of "characters" contained is called a "character set." Specifies that each "character" is stored in one byte or multiple bytes, with which bytes are stored, which is called "encoding". Each country and region in the development of coding standards, "character collection" and "coding" are generally developed at the same time. Therefore, the usual "character set", such as: GB2312, GBK, JIS, etc., in addition to the "set of characters" this layer of meaning, but also contains the meaning of "coding". Note: There are several encoding methods for the Unicode character set, such as UTF-8, UTF-16, and so on; there is only one ASCII, and most MBCS (including GB2312) have only one. 2, what is the inner code? 2.1 Explanation of WikipediaIn computer science and related fields, the internal code refers to the "encoding of information after it is stored in a particular memory device in a certain way". In different systems, there will be different internal codes. In the previous English system, the inner code was ASCII. In the traditional Chinese system, the current commonly used inside code for the large five yards (Big5). In the Simplified Chinese system, the inner code is the GB code (national Standard Code: now mandatory to use the GB18030 standard; older computers still use GB2312). Uniform Code (UNICODE) is another common inner code. 2.2 Explanation of Baidu EncyclopediaThe inner code refers to the binary character code used in the whole system, which is the exchange code between the communication input, the output and the system platform, which can achieve the purpose of universal and high efficiency transmission through the inner code. For example, MS Word stores and calls the inner code rather than the graphic text. English ASCII characters in a byte of the internal code representation, Chinese characters such as the national character set, GB2312, GB12345, GB13000 are double-byte inside code, GB18030 (27,533 Kanji) Double-byte inner code Chinese characters 20,902, the remaining 6, 631 characters with four-byte inner code. 3. Character encoding classification summary The character encoding is summarized below from the perspective of the computer's support for multiple languages. 3.1 ASCII encodingThe following is from "Wikipedia": ASCII (American Standard Code for information Interchange, US Information Interchange standards Codes) is a set of computer coding systems based on the Latin alphabet. It is mainly used to display modern English, and its extended version Eascii can barely display other Western European languages. It is the most versatile single-byte encoding system available today (but with signs of Unicode tracking) and is equivalent to ISO/IEC 646. ASCII was first published in canonical form in 1967, the last update was in 1986, so far a total of 128 characters were defined, of which 33 characters could not be displayed (this is based on the current operating system, but in DOS mode can show some such as Smiley, Poker, such as 8-bit symbols), and these 33 characters are mostly obsolete control characters. The main purpose of controlling characters is to manipulate the text that has been processed. In addition to 33 characters, there are 95 characters that can be displayed, including white space characters that are created by tapping a blank key with the keyboard, which also counts as 1 display characters (blank). ASCII table: See Http:// ASCII disadvantage: The biggest disadvantage of ASCII is that it can only display 26 basic Latin letters, Arabic numbers and English punctuation marks, So it can only be used to display modern American English (and when dealing with loanwords in English such as naïve, café, élite, and so on, all accented symbols have to be removed, even if this violates the spelling rules). While Eascii has solved some of the problems of display in Western European languages, there is still nothing to do with more languages. So now Apple computers have abandoned ASCII and switched to Unicode. The first in-system code for the English DOS operating system is: ASCII. The computer only supports English at this time, and other languages cannot be stored and displayed on the computer. At this stage, a single-byte string holds one character (Sbcs,single byte Character System) using one byte. For example, "Bob123" accounts for 6 bytes. 3.2 ANSI EncodingTo enable the computer to support more languages, you typically use the 0x800~xff range of 2 bytes to represent 1 characters. For example: Chinese characters ' in ' in the Chinese operating system, using [0xd6,0xd0] These two bytes of storage. Different countries and regions have set different standards, resulting in the Gb2312,big5,jis and other coding standards. These use 2 bytes to represent a character of a variety of Chinese character extension encoding, called ANSI encoding. Under the Simplified Chinese system, ANSI encoding represents GB2312 encoding, and in Japanese operating system, ANSI encoding represents JIS code. Different ANSI encodings are incompatible, and when information is exchanged internationally, text that is in two languages cannot be stored in the same piece of ANSI-encoded text. Chinese dos, Chinese/Japanese the Windows 95/98 ERA system code uses ANSI encoding (localization) to support the multi-lingual phase using ANSI encoding, each character is represented by one byte or more bytes (Mbcs,multi-byte Character System), so the characters stored in this way are also known as multibyte characters. For example, "Chinese 123" in Chinese Windows 95 memory is 7 bytes, each Kanji Account 2 bytes, each English and numeric characters accounted for 1 bytes. In non-Unicode environments, there is a good chance that all characters will not be displayed properly due to inconsistent character sets adopted by different countries and regions. Microsoft has used the technology of the code page (Codepage) conversion table to solve this problem by converting non-Unicode character encodings into Unicode encodings used internally by the same character by the specified conversion table. You can select a code page in language and locale as the default encoding for non-Unicode encodings, such as 936 for the Simplified Chinese gbk,950 to Traditional Chinese Big5 (referred to as used on a PC). In this case, the software and documentation written in some non-English European languages are likely to be garbled. The problem occurs when you set the code page to the appropriate language for the Chinese processing, which is unavoidable.  Fundamentally, the full adoption of unified coding is the solution, but it is not yet possible to do so. Code page technology is now widely used by various platforms. The code page for UTF-7 is the code page for 65000,utf-8 is 65001. 3.3 Unicode encodingIn order to facilitate international exchange of information, international organizations have developed a UNICODE character set that sets a uniform and unique numeric number for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing. The Unicode character set can be abbreviated to UCS (Unicode Character set). The early Unicodeunicode standard has UCS-2, UCS-4 's saying. The UCS-2 is encoded in two bytes, and the UCS-4 is encoded in 4 bytes. After UNICODE is adopted, when the computer holds the string, it instead holds the ordinal number of each character in the UNICODE character set. Currently the computer typically uses 2 bytes (16 bits) to hold a sequence number (dbcs,double byte Character System), so the characters stored in this way are also referred to as wide-byte characters. For example, the string "Chinese 123" under Windows 2000, the memory is actually stored in 5 sequence numbers, total 10 bytes. The Unicode character set contains all the "characters" that are used in various languages. There are many criteria for encoding UNICODE character sets, such as: UTF-8, UTF-7, UTF-16, Unicodelittle, Unicodebig, and so on. 4. Common coding Rules 4.1 Single-byte character encoding(1) Coding standard: Iso-8859-1. (2) Description: The simplest encoding rules, each byte directly as a UNICODE character. For example, [0xd6, 0xD0] Two bytes, when converted to a string by iso-8859-1, will be directly [0x00d6, 0x00d0] Two UNICODE characters, that is, "Öð". Conversely, when converting a UNICODE string through iso-8859-1 to a byte string, only the characters of the 0~255 range are converted normally. 4.2 ANSI Encoding(1) GB2312, BIG5, Shift_JIS, Iso-8859-2. (2) When converting Unicode strings to "byte strings" through ANSI encoding, a Unicode character may be converted to one byte or more bytes, according to the respective encoding. Conversely, when a byte string is converted to a string, it is possible to convert multiple bytes into one character. For example, [0xd6, 0xD0] These two bytes, through GB2312 conversion to a string, will get [0x4e2d] a character, the word ' medium '. "ANSI Encoding" features: (1) These ANSI encoding standards can only handle UNICODE characters within their respective language ranges. (2) The relationship between "UNICODE characters" and "converted bytes" is artificially defined. 4.3 Unicode encoding(1) Coding standard: UTF-8, UTF-16, Unicodebig. (2) similar to "ANSI encoding", when a string is converted to a "byte string" by Unicode encoding, a Unicode character may be converted to one byte or more bytes. Unlike ANSI encoding: (1) These "Unicode encodings" are capable of processing all Unicode characters. (2) between "UNICODE characters" and "converted bytes" can be computed. We don't really need to delve into the exact number of bytes encoded in each encoding, we just need to know that the concept of "coding" is to convert "character" to "byte". For "Unicode encoding", because they can be computed, so in a special situation, we can understand a certain kind of "UNICODE encoding" is the rule. 5, the difference between the code 5.1 GB2312, GBK and GB18030  (1) gb2312   when people in China get a computer, there is no available byte state to represent Chinese characters, and there are more than 6,000 commonly used Chinese characters need to be preserved, so think of those ASCII code in 127th after the singular symbols are directly canceled off, Rule: A character less than 127 is the same as the original, but two more than 127 words connect prompt together, it represents a Chinese character, the preceding byte (called the High Byte) is used from 0xa1 to 0xf7, the back one byte (low byte) from 0xa1 to 0xFE, This allows us to assemble about 7,000 + Simplified Chinese characters. In these codes, we also put mathematical symbols, Roman Greek alphabet, Japanese kana have been compiled into, even in ASCII, the number, punctuation, letters are all re-compiled two bytes long code, this is often said "full-width" character, and the original under 127th is called "Half-width" character. This scheme of Chinese characters is called "GB2312". GB2312 is a Chinese extension to ASCII. Compatible with ASCII.   (2) gbk   but Chinese characters are too many, we soon found that there are many people's names there is no way to play out here, have to continue to GB2312 not used to find the code point to use. Later still not enough, so simply no longer require that the low byte must be 127th after the inner code, as long as the first byte is greater than 127 fixed indicates that this is the beginning of a Chinese character, whether followed by the expansion of the character set in the content. The result of the expanded encoding scheme is called the "GBK" standard, and GBK includes all the contents of the GB2312, while adding nearly 20,000 new Chinese characters (including traditional characters) and symbols.   (3) gb18030   later, the minority will also use the computer, so we expand, and added thousands of new minority characters, GBK expanded into GB18030. Since then, the Chinese nation's culture can be passed on in the computer age.    China's handlers see this series of Chinese character coding standards as good, so they're called "DBCS" (Double byte Charecter set DWORD character set). In the DBCS series of standards, the biggest feature is the two-byte long Chinese characters and one-byte long English characters coexist in the same set of coding scheme, so they write the program in order to support the Chinese processing, must pay attention to the string of each byte value, if this value is greater than 127, Then it is assumed that a character in a double-byte character set appears. In this case, "a Chinese character counts two English characters!" "。 However, this is not always the case in the Unicode environment.    5.1 Unicode and BigendianunicodeThese two indicate different storage order, such as the Unicode encoding for "A" is 6500, and the Bigendianunicode encoding is 0065. 5.2 UTF-7, UTF-8 and UTF-16In Unicode, all characters are treated equally. Chinese characters no longer use "two extended ASCII", but instead use "1 Unicode", note that now the Chinese character is a "one character", so, chaizi, statistical words of these problems will naturally solve.  However, the world is not ideal, it is not possible overnight all systems use Unicode to process characters, so Unicode must consider a serious problem on the date of birth: incompatibility with the ASCII character set. We know that the ASCII character is a single byte, such as the ASCII of "A" is 65. Unicode is a double-byte, such as the Unicode of "A" is 0065, which creates a very big problem: the same mechanism that previously handled ASCII cannot be used to handle Unicode. Another more serious problem is that the C language uses ' + ' as the end of the string, and Unicode has a lot of characters that have a byte of 0, so that the C-string function will not handle Unicode properly. Unless all of the world's programs written in C and the libraries they use are all replaced. Thus, the greater things than Unicode were born, and the reason that it is greater is because it allows Unicode to no longer exist on paper, but is true in all of our computers. That's it: UTF. utf= UCS Transformation Format, UCS conversion (transfer). It is a rule that corresponds to the actual encoding of the Unicode encoding rules and the computer. There are 2 types of UTF that are popular now: UTF-8 and UTF-16. Both of these are Unicode-encoded implementations. 5.2.1 UTF-8UCS-2 encoding (16 binary) UTF-8 byte stream (binary) 0000-007f 0xxxxxxx0080-07ff 110xxxxx 10xxxxxx0800-ffff 1110xxx x 10xxxxxx 10xxxxxx Unicode encoding for example "Han" is 6c49. 6c49 is between 0800-ffff, so I'm sure to use a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. The 6c49 is written as binary: 0110 110001 001001, using this bitstream in turn instead of the template x, get: 11100110 10110001 10001001, that is, E6 B1 89. Visible UTF-8 are variable-length characters that encode Unicode as 00000000-0000007f, are represented by a single byte, 00000080-000007ff characters are represented by two bytes, and 00000800-0000ffff characters are represented by 3 bytes. Since the UNICODE-16 specification does not specify more than FFFF characters so far, UTF-8 uses up to 3 bytes to represent one character.  In theory, however, UTF-8 requires a maximum of 6 bytes to represent a character. UTF-8 is compatible with ASCII. 5.2.2 UTF-16 (standard Unicode becomes UTF-16) UTF-16 is consistent with the encoding specifications of the above-mentioned Unicode itself. The UTF-16 encodes the UCS as a 16-bit unit. For UCS codes smaller than 0x10000, the UTF-16 encoding is equal to the 16-bit unsigned integer corresponding to the UCS code. For UCS codes that are not less than 0x10000, an algorithm is defined. However, because the actual use of UCS2, or UCS4 bmp is necessarily less than 0x10000, so for the time being, UTF-16 and UCS-2 can be considered basically the same. But UCS-2 is just a coding scheme, UTF-16 is used for actual transmission, so we have to consider the problem of byte order. UTF-16 is not compatible with ASCII. 5.2.3 UTF-7The UTF-7 (7-bit Unicode conversion format (Unicode Transformation Format, abbreviated UTF)) is a variable-length character encoding that renders Unicode characters in an ASCII-encoded string that can be applied to an e-mail message Applications such as loss. UTF-7 is not one of the Unicode standards. For more information, you can refer to the relevant materials. 6, Unicode and Utfunicode are memory-encoded representations (which are specifications), and UTF is the solution for how to save and transmit Unicode (is implemented). 6.1 UTF byte-order and BOM 6.1.1-byte sequenceUTF-8 is a byte-coded unit with no byte order problem. UTF-16 takes two bytes as the encoding unit, before interpreting a UTF-16 text, it is first to clarify the byte order of each coding unit. For example, the Unicode encoding for receiving a "Kui" is 594E, and the Unicode encoding for "B" is 4E59. If we receive the UTF-16 byte stream "594E", then is this "Kui" or "B"? The recommended method for labeling byte order in the Unicode specification is the BOM. The BOM is not a BOM for "Bill of Material", but a byte Order Mark. The BOM is a bit of a clever idea: there is a character called "ZERO WIDTH no-break SPACE" in the UCS code, and its encoding is Feff. Fffe is not a character in UCS, so it should not appear in the actual transmission. The UCS specification recommends that the character "ZERO WIDTH no-break SPACE" be transmitted before the byte stream is transmitted. This means that if the recipient receives Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. So the character "ZERO WIDTH no-break SPACE" is also called a BOM. The UTF-8 does not require a BOM to indicate byte order, but it can be used to indicate the encoding using a BOM. The UTF-8 code for the character "ZERO WIDTH no-break SPACE" is the EF BB BF (readers can verify it with the coding method we described earlier). So if the receiver receives a byte stream beginning with the EF BB BF, it knows that this is UTF-8 encoded. 6.1.2 BOM(1) Origin of the BOM to identify Unicode files, Microsoft recommends that all Unicode files begin with a ZERO WIDTH nobreak SPACE (U+feff) character. This acts as a "signature" or "byte order mark (Byte-order Mark,bom)" To identify the encoding and byte order used in the file. (2) different system support for BOM because some systems or programs do not support BOMs, Unicode files with BOMs can sometimes cause problems. ①jdk1.5 and previous reader cannot process UTF-8 encoded files with a BOM, and when parsing an XML file of this format, an exception is thrown: Content is not allowed in Prolog. "For the workaround, I'll write an article to discuss the issue." "②linux/unix does not use the BOM because it destroys the syntax conventions of existing ASCII files. ③ different editing tools to deal with the BOM. When you save a file as UTF-8 encoding using the Windows-brought Notepad, Notepad automatically inserts the BOM at the beginning of the file (although the BOM is not required for UTF-8). Many other editors can choose to use the BOM instead. This is true of UTF-8 and UTF-16. (3) BOM and XML XML parsing when reading an XML document, the Consortium defines 3 rules: ① If there is a BOM in the document, the file encoding is defined; ② if there is no BOM in the document, look at the encoding attribute in the XML declaration, ③ if none of the above, assume that the XML document is UTF-8 encoded. 6.2 Determining the character set and encoding of textSoftware typically has three ways to determine the character set and encoding of text. (1) The most standard way for Unicode text is to detect the first few bytes of text. such as: Opening byte charset/encoding EF BB BF UTF-8 fe FF utf-16/ucs-2, little Endian (utf-16le) FF fe utf-16/ucs-2, Big Endian (utf-16be) FF FE xx utf-32/ucs-4, little endian. The FE FF utf-32/ucs-4, Big-endia (2) Adopt a more secure way to determine the character set and its encoding, which is to pop up a dialog box to ask the user. However, the MBCS text (ANSI) does not have these character set tags at the beginning, and now many software save text is Unicode, you can choose whether to save these character set marks at the beginning. Therefore, software should not rely on this approach. At this point, the software can take a more secure way to determine the character set and its encoding, that is, pop up a dialog box to consult the user. (3) Take your Own "guess" method. If the software does not want to bother the user, or it is inconvenient to ask the user, it can only take its own "guess" method, the software can be based on the characteristics of the whole text to guess which charset it may belong to, which is probably not allowed. This is the case when you use Notepad to open the "Unicom" file. (the original ANSI-encoded file as UTF-8 processing, detailed instructions see: 6.3 Several codes for Notepad(1) ANSI encoding notepad the default encoding format is: ANSI, that is, the local operating system default internal code, Simplified Chinese is generally GB2312. How does this prove? When you save with Notepad, you open it with a text editor such as EmEditor, EditPlus, and UltraEdit. It is recommended to use EmEditor, after opening, in the lower corner will show the encoding: GB2312. (2) Unicode encoding with Notepad save, encoding Select "Unicode", open the file with EmEditor, found that the encoding format is: Utf-16le+bom (signed). Viewed in hexadecimal, the first two bytes were found to be: FF FE. This is the BOM. (3) Unicode big endian Save with Notepad, encoding select "Unicode", open the file with EmEditor, found that the encoding format is: Utf-16be+bom (signed). Viewed in hexadecimal, the first two bytes were found to be: FE FF. This is the BOM. (4) UTF-8 with Notepad save the last, encoding select "UTF-8", with EmEditor open the file, found that the encoding format is: UTF-8 (signed). Viewed in hexadecimal, the first three bytes were found to be: EF BB BF. This is the BOM. 7, several misunderstandings, as well as the causes of garbled and solutions 7.1 Misunderstanding OneWhen the "byte string" is converted to "UNICODE string", such as when reading a text file, or transmission of text over the network, it is easy to use "byte string" as a single-byte string, with each "one byte" is a "one character" method for conversion. In fact, in a non-English environment, the "byte string" should be used as an ANSI string, with the appropriate encoding to get the UNICODE string, it is possible "multiple bytes" to get "one character." Usually, developers who have been developing in the English environment are prone to this misunderstanding. 7.2 Misunderstanding twoIn non-UNICODE environments such as Dos,windows 98, strings exist in the form of ANSI-encoded bytes. This string, which exists as a byte, must know which encoding is used correctly. This gives us a sense of inertia: "The encoding of a string." When UNICODE is supported, the string in Java is stored as the "ordinal" of a character, not as "a coded byte", so there is no longer a concept of "encoding the string." The concept of encoding is only available when the string and byte strings are converted, or when a "byte string" is treated as an ANSI string. A lot of people have this misunderstanding. 7.3 Analysis and ResolutionThe first misunderstanding, is often caused by garbled causes. The second misconception often leads to more complex garbled problems that could easily be corrected. Here, we can see that the "misunderstanding one", that is, the use of each "one byte" is "one character" conversion method, in fact, it is equivalent to the use of iso-8859-1 conversion. Therefore, we often use bytes = String.getbytes ("iso-8859-1") to reverse the operation and get the original "byte string". Then use the correct ANSI encoding, such as String = new string (bytes, "GB2312"), to get the correct "UNICODE string". 8. Reference and in-depth reading study Materials 8.1 characters, bytes and encodings (strongly recommended) 8.2 about encoding: ASCII (ANSI), gb-2312, Unicode, UTF8 " The difference of ANSI,UTF8,UNICODE,ASCII code" HTTP// Baidu Encyclopedia "Unicode" 40801.htm8.5 What is the connection or difference between Unicode and utf-8/utf-16? "Http://

This article is from the "Xu Neuhua polaris" blog, be sure to keep this source

Character encoding

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: