Character encoding Summary

Source: Internet
Author: User
Tags coding standards

Recently caused by character encoding problems very headache, many encoding methods can be described as "familiar with the unknown", gb2312, ANSI, UTF-8, Unicode .... So calm down and study hard.

References:

Http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

Http://www.regexlab.com/zh/encoding.htm

 

 

Development of character and encoding

 

System internal code

Description

Phase 1

ASCII

At the beginning, the computer only supports English, and other languages cannot be stored and displayed on the computer. The ASCII code consists of a total of 128 characters. For example, the space is 32 (Binary 00100000), and the uppercase letter A is 65 (Binary 01000001 ). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last seven digits of one byte, and the first one digit is set to 0.

Phase 2

ANSI coding (localization)

To enable the computer to support more languages, we usually use 0x80 ~ 2 bytes in the 0xff range to 1 character. For example, in the Chinese operating system, the byte [0xd6, 0xd0] is used for storage. Different countries and regions have developed different standards, resulting in respective coding standards such as gb2312, big5, and JIS.These two bytes are used to represent the extended Chinese character encoding methods of a single character. They are called ANSI encoding.In a simplified Chinese system, ANSI encoding represents gb2312 encoding. In a Japanese operating system, ANSI encoding represents JIS encoding. Different ANSI encodings are incompatible. When information is exchanged internationally, texts in two languages cannot be stored in the same ANSI encoded text.

Phase 3

Unicode (International)

To facilitate international information exchange,International organizations have developed UNICODE character sets and set a unified and unique number for each character in a variety of languagesTo meet the requirements of cross-language and cross-platform text conversion and processing. Unicode is, of course, a large collection. The current size can contain more than 1 million characters. Each symbol is encoded differently. For example, U + 0639 represents the Arabic letter ain, U + 0041 represents the English capital letter A, and U + 4e25 represents the Chinese character "strict ". It should be noted that Unicode is only a symbolic set, which only specifies the binary of the symbolCodeBut does not specify how the binary code should be stored.

 

Before further discussion, we should first clarify several concepts:

1. characters are visible and understandable characters. bytes are the form of characters stored by computers.

2. When a computer stores characters in bytes, it must use some encoding method to decode the characters in this known encoding method. Therefore, the encoding method is selected for both saving and opening.

As can be seen from the table above, encoding similar to gb2312 is one of the ANSI specifications. In Windows notepad, we can choose the ANSI encoding method to save the text (this encoding method is used by default), while in Windows systems of different languages, the ANSI encoding method is different; in the Simplified Chinese system, the ANSI character in notepad is gb2312. Therefore, if you use the default storage method (ANSI) to save text containing Chinese characters in the English system, the notepad will give a prompt. If you ignore this, Chinese information will be lost. When you open a file using notepad, notepad automatically detects the encoding method of the current file and decodes the file using the corresponding encoding method to reproduce the characters, of course, ANSI encoding is used by default in the current system language environment.

 

Unicode implementation --- UTF-8

With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used Unicode implementation method on the Internet. Other implementations also include UTF-16 and UTF-32, but are basically not needed on the Internet.Repeat, the relationship here is that UTF-8 is one of the Unicode implementation methods.

The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.

UTF-8 coding rules are very simple, only two:

1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the Unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.

2) for the n-byte symbol (n> 1), the first N bits of the first byte are set to 1, and the N + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all Unicode codes of this symbol. (Some people may have doubts, isn't n a maximum of 2. In fact, the Unicode Character Set code is not necessarily two bytes in the preceding example. It may be more than two .)

The following table summarizes the encoding rules. The letter X indicates the available encoding bits.

Unicode symbol range | UTF-8 encoding method
(Hexadecimal) | (Binary)
-------------------- + ---------------------------------------------
0000 0000-0000 007f | 0 xxxxxxx
0000 0080-0000 07ff | 110 XXXXX 10 xxxxxx
0000 0800-0000 FFFF | 1110 XXXX 10 xxxxxx 10 xxxxxx
0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

Next, we take Chinese characters "strict" as an example to demonstrate how to implement UTF-8 encoding.

It is known that the Unicode of "strict" is 4e25 (100111000100101). According to the above table, we can find that 4e25 is in the range of the third row (0000-0800 FFFF ), therefore, the "strict" UTF-8 encoding requires three bytes, that is, the format is "1110 XXXX 10 xxxxxx 10xxxxxx ". Then, starting from the last binary bit of "strict", fill in X in the format from the back to the front, and fill the extra bit with 0. In this way, the "strict" UTF-8 code is "11100100 10111000 10100101", converted to hexadecimal is e4b8a5.

 

Windows notepadProgramOptional encoding methods include ANSI, Unicode, Unicode big endian and UTF-8.

1) ANSI is the default encoding method. English files are ASCII encoded files, while simplified Chinese files are gb2312 encoded files (only for Windows Simplified Chinese versions, if they are traditional Chinese versions, big5 codes will be used ).

2) unicode encoding refers to the UCS-2 encoding method, that is, directly using two bytes to store the character Unicode code (for characters larger than two bytes cannot be stored. The UTF-16 has extended UNICODE, including some rare characters, to our country's Manchu, Tibetan and so on, the two are basically equivalent ). This option uses the little endian format.

3) Unicode big endian encoding corresponds to the previous option. The next section explains the meanings of little endian and big endian.

4) UTF-8 coding, that is, the encoding method mentioned above.

 

Little endian and big endian

As mentioned in the previous section, Unicode codes can be stored directly in UCS-2 format. Take the Chinese character "Yan" as an example. The Unicode code is 4e25 and needs to be stored in two bytes. one byte is 4E and the other byte is 25. During storage, 4e is in the front, 25 is in the back, that is, the big endian mode; 25 is in the front, and 4E is in the little endian mode.

 

 

Automatic Detection encoding method --- BOM

As defined in the Unicode specification, each file is preceded by a character indicating the encoding order. The character is called "Zero Width, non-line feed space" (Zero Width, no-break space ), expressed in Fe ff. This is exactly two bytes, and FF is 1 larger than Fe. If the first two bytes of a text file are Fe ff, it indicates that the file adopts the big header mode. If the first two bytes are FF Fe, it indicates that the file adopts the Small Header mode.

In Windows, Unicode encoding indicates the file header indicating the byte order, also known as BOM (byte-order mark). fffe and feff are different Bom.

For example:

1) UNICODE: FF Fe indicates that it is stored as a small header.

2) Unicode big endian: Fe FF indicates that it is stored in the big data storage mode.

3) UTF-8: ef bb bf indicates that this is a UTF-8 code.

 

Windows codePage

CodePage is a series of values that indicate different encoding specifications. Common windows codePage values include:

  • 874-Thai
  • 932-Japan
  • 936-Chinese (simplified) (PRC, Singapore)
  • 949-Korean
  • 950-Chinese (traditional) (Taiwan, Hong Kong)
  • 1200-Unicode (BMP of ISO 10646, UTF-16LE)
  • 1201-Unicode (BMP of ISO 10646, UTF-16BE)
  • 1250-Latin (Central European ages)
  • 1251-Cyrillic
  • 1252-Latin (Western European ages)
  • 1253-Greek
  • 1254-Turkish
  • 1255-Hebrew
  • 1256-Arabic
  • 1257-Latin (Baltic ages)
  • 1258-Vietnamese
  • 65000-Unicode (BMP of ISO 10646, UTF-7)
  • 65001-Unicode (BMP of ISO 10646, UTF-8)

    The Windows kernel uses Unicode. For non-Unicode Windows applications, when Windows presents the GUI, you need to know which codePage (encoding) is used to render the string. This codePage can be set through the control panel. Therefore, garbled characters are displayed when some Chinese software is installed in the English system, probably because codePage is not set to 936.

  •  

     

    Some thoughts and experiences

    1. UTF-8 is widely accepted, probably because of its uniformity and excellent "compression" capability. First of all, it is a unicode implementation, that can achieve internationalization; secondly, for English storage it only uses one byte, and UTF-32 no matter what uses 4 bytes, Unicode (UTF-16) both bytes are used to save space. For the use of more Chinese documents UTF-8 may not have good results, many common Chinese words are stored in 3 bytes.

    2. Unicode can be used as the intermediate encoding method to convert various encodings. For more information, see question 2.

     

     

    Programming related to character encoding in. net

    EncodingClass is used to handle Encoding Problems.

    Encodinginfo [] encoding. getencodings ()

    This is a static method. Return all the encoding methods contained in the current system. Each encoding method includes the name, display name, and corresponding codePage value.

    Encoding. getencoding (string name)

    This is a static method. This method also contains multiple overload values. The encoding object is returned Based on the encoding name.

    Encoding. Default \ encoding. Unicode...

    Static attributes. It provides a convenient way to return some commonly used encoding objects, in which encoding. Default returns the localized encoding method set by the current system. In the Windows System of Simplified Chinese, it should be equivalent to gb2312.

    Byte [] encoding. getbytes (string S)

    This is an instance method. If the instance object is an object that corresponds to the UTF-8 encoding, then this method means to encode the string s in the UTF-8EncodingByte array. (Because the window kernel uses the Unicode Character Set, string is stored in the memory in a unicode way, this method is actually done to convert Unicode into a UTF-8)

    String encoding. getstring (byte [] bytes)

    This is an instance method. If the instance object is an object that corresponds to the UTF-8 encoding, it means to encode the character array bytes in the UTF-8DecodingString. (That is, UTF-8 to Unicode) This method is often used to convert a stream to a string according to some encoding.

     

    Question 1: judge whether the character is Chinese

    I see my colleagues do this:
    Static public bool ischina (char CHR)
    {

    If (convert. toint32 (CHR) <convert. toint32 (convert. tochar (128 )))

    Return false;

    Else

    Return true;
    }

    This is obviously wrong.The characters are not only Chinese characters, English letters, and numbers. I checked the Unicode encoding table and concluded thatThe range of Chinese characters in the Unicode Character Set should be 0x4e00-0x9fc3Therefore, it is not correct to determine if the optical value is greater than 128. It should also be emphasized that the reason for viewing Unicode instead of gb2312 or UTF-8 encoding tables is because Char and string are stored in Unicode mode in Windows-managed memory.

     

    Question 2: how to correctly decode the Stream Containing the webpage code returned by an HTTP request

    This problem is mainly caused by inconsistent encoding methods. When the Internet is not developed, localization encoding is constantly emerging. As a result, when the Internet develops, encoding is not uniform. Many websites only consider local encoding, the charset of domestic Web pages, including Baidu, is gb2312. Some international websites use UTF-8, such as Microsoft, Google. (If the HTTP request www.google.com will find charset is big5, but opened with IE is charset into UTF-8, not to understand)

    Therefore, if we program a request for a webpage, the returned stream is. How can we use proper encoding to decode this stream? The idea is actually very simple: for general web code, there are Meta and charset, while English characters are compatible in various encoding methods, we can use any encoding method to first decode the stream to program the string, and then use a regular expression to match charset to obtain the webpage encoding method, then, encode the string in the previous encoding method back to the byte stream, and then use the matching encoding method to decode the byte stream.

    This method is powerless for the absence of charset and web pages!

    Key code:

    ...

    Responsestream = webrequest. getresponse (). getresponsestream ();
    String resultstringcontent;
    String prestringcontent;
    Using (streamreader sr = new streamreader (responsestream, encoding) // encoding can be any encoding method, generally encoding. Default
    {
    Prestringcontent = Sr. readtoend ();
    Byte [] bytecontent = encoding. getbytes (prestringcontent); // encode this string using this encoding to obtain the original byte []
    Resultstringcontent = trytogetencoding (prestringcontent, encoding). getstring (bytecontent); // trytogetencoding can be found below
    }

     

    Private Static encoding trytogetencoding (string content, encoding tryencoding)
    {
    /* Try to parse HTML */
    Matchcollection MC = RegEx. Matches (content, "<meta [^>] *>", regexoptions. ignorecase );
    Foreach (Match m in MC)
    {
    If (M! = NULL & M. value! = String. Empty)
    {
    Match Mm = RegEx. Match (M. Value, @ "charset [\ t] * = [\ t] * [\ W-_] + ");
    If (Mm! = NULL & mm. value! = String. Empty)
    {
    String S = mm. value;
    Int start = S. indexof ('=') + 1;
    Encoding retencoding = NULL;
    Try
    {
    Retencoding = encoding. getencoding (S. substring (START, S. Length-Start). Trim ());
    }
    Catch
    {
    }
    If (retencoding! = NULL)
    Return retencoding;
    }
    }
    }
    Return tryencoding;
    }

     

     

    Problem 3: streamreader reads stream or file garbled characters

    Streamreader is actually a stream reading class with the codec function, so you can see that it has the readtoend () method that can directly return string, and its read method does not return byte [], it is Char [].

    Streamreader reads garbled characters in one sentence. It must be: encoding problem! However, this answer is too general. The essence is encoding, but we did not notice the details when using streamreader.

    After some test experiments on streamreader, I will summarize them as follows:

    Streamreader Constructor

    Note that the constructor of streamreader has up to 10 reloads. There are four key construction parameters, and one parameter is irrelevant to encoding. We will not discuss it here:

    1) stream: transmits a stream object to streamreader.

    2) string path: transmits a file path to streamreader.

    3) bool detectencodingfrombyteordermarks: Specifies whether the BOM is automatically checked, only for Unicode (L \ B), UTF-8, UTF-16 (L \ B), UTF-32 (L \ B) Encoding

    4) encoding: Specifies the encoding method.

    These four parameters are combined to construct up to 10 reloads. These combinations make streamreader's working methods confusing. This is also the root cause of garbled characters. Let's explain them one by one:

    & Lt; tr Height = "144" & gt;
    Serial number Prototype File Encoding
    UTF-8 Default (gb2312) Unicode (l)
    1 Public streamreader (Stream)
    Encoding is set to UTF-8 by default to automatically detect BOM
    Normal Garbled (because there is no Bom, decoding with UTF-8) Normal (with BOM)
    2 Public streamreader (string path)
    Encoding is set to UTF-8 by default (here msdn says the default character encoding is used, experiment proves that the default is UTF-8), automatically detect BOM
    Normal Garbled (because there is no Bom, decoding with UTF-8) Normal (with BOM)
    3 Public streamreader (
    Stream stream,
    Bool detectencodingfrombyteordermarks
    )
    Encoding is set to UTF-8 by default, you can set detection Bom, default detection
    Normal Garbled (because there is no Bom, decoding with UTF-8) If this parameter is set to not be detected, it will be garbled. If this parameter is set, it will be normal.
    4 Public streamreader (
    Stream stream,
    Encoding
    )
    FirstCheck Bom. If BOM exists, the encoding parameter is ignored. If no Bom exists, the encoding parameter is applied.
    Normal Normal when encoding is set to encoding. Default Normal
    5 Public streamreader (
    String path,
    Bool detectencodingfrombyteordermarks
    )
    Encoding is set to UTF-8 by default, you can set detection Bom, default detection
    Normal Garbled (because there is no Bom, decoding with UTF-8) If this parameter is set to not be detected, it will be garbled. If this parameter is set, it will be normal.
    6 Public streamreader (
    String path,
    Encoding
    )
    Check BOM first. If BOM exists, the encoding parameter is ignored. If no Bom exists, encoding is applied.
    Normal Normal when encoding is set to encoding. Default Normal
    7 Public streamreader (
    stream,
    encoding,
    bool detectencodingfrombyteordermarks
    )
    check whether detectencodingfrombyteordermarks is true. If true, the behavior is the same as 4; otherwise, encoding is used for decoding.
    omitted omitted omitted
    8 Public streamreader (
    String path,
    Encoding encoding,
    Bool detectencodingfrombyteordermarks
    )
    Check whether detectencodingfrombyteordermarks is true. If it is true, the behavior is the same as 6; otherwise, encoding is used directly.
    Omitted Omitted Omitted

    As shown in the summary in the above table,Automatic Inspection of BOM has the highest priority. encoding is applied if detection is canceled or no Bom is available. encoding is applied if the parameter provides encoding, otherwise the UTF-8 is used.

    So the final solution is streamreader sr = new streamreader (filestream,Encoding. Default).

    In addition, for the second article in the table above, my experiment results are different from the description of msdn and need to be verified.

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.