Character Set and encoding details, character set encoding details

Source: Internet
Author: User

Character Set and encoding details, character set encoding details
Character Set and encoding---- By bibimbap

Today, I was troubled by a python encoding problem for half an afternoon. The encoding problem has always been a tough issue, write this article to briefly summarize some common coding problems and solutions in python. This is the first article. First, we will summarize some basic concepts and contents of character sets and encoding.


Various character sets and encodings, including ASCII, MBCS, and Unicode, are often seen in programming. Specifically, Character Set and encoding are two different concepts, but some of them overlap. For ASCII and MBCS character sets, the preceding Character Set scheme only adopts one encoding scheme, while for Unicode, the character set and encoding scheme are clearly differentiated.


1 ASCII

The ASCII standard itself specifies the character and character encoding methods. The single-byte encoding can encode a total of 128 characters. For example, the space encoding is 32, and the lowercase letter a is 97, therefore, ASCII is both a character set and an encoding scheme.


2 MBCS

For English, the code of 128 symbols is enough, but it is obviously not enough for other languages, such as Chinese. Therefore, the Multi-Byte Character Set (MBCS) is displayed ). For example, GB2312, GBK, GB18030, and BIG5 all belong to MBCS. Because MBCS mostly uses two bytes of encoding, sometimes it is also called DBCS (Double-Byte Character Set ). In Linux, we can see that the file encoding containing Chinese characters is usually CP936, which is actually GBK encoding. The reason for this name is that IBM once invented the concept of Code Page, among these multi-byte encodings, GBK encoding is exactly on the 936 page, so CP936 is short for short.


3 Unicode

Then we thought it was too inconvenient to encode all types of characters. It was better to use a set of character sets to represent all the characters, so Unicode appeared. Unicode/UCOS (Unicode Character Set) is only a Character Set standard, but it does not specify the Character storage and transmission mode. Unicode is a character set rather than a specific encoding, it mainly has three encoding methods: initially Unicode standard uses 2 bytes to represent a character, the encoding scheme is UTF-16, there is also a UTF-32 of encoding schemes that use 4 bytes to represent a single character. And later using English characters of the country feel bad, the original storage of a character is now changed to 2 characters, space doubled, thus UTF-8 encoding. In UTF-8 encoding, English occupies one byte, and Chinese occupies three bytes.


As mentioned above, Unicode Character Set mainly uses UTF-8, UTF-16 and other methods for encoding storage. In this case, how does the computer know which encoding method the file adopts? The Unicode Specification defines that a character BOM (Byte Order Mark) is added at the beginning of each file to indicate the encoding Order ). For example, the UTF-16 code of the "Stone" in the "Shi" in the "Shi pan" is 77F3, using the UTF-16 method to store 2 bytes, one byte is 77, one byte is F3. if 77 is in front, f3 is followed by the big endian method. In contrast, it is in the Little endian mode ., This character is exactly two bytes, namely FEFF. If two bytes of a text file header are in FEFF format, the Big endian mode is used for encoding; otherwise, the Little endian mode is used. While the BOM of UTF-8 is EFBBBF, which is summarized as follows:

BOM_UTF8 '\xEF\xBB\xBF' BOM_UTF16_LE '\xFF\xFE' BOM_UTF16_BE '\xFE\xFF'


Not all editors write data to the BOM, But Unicode can still be read even if there is no BOM. You only need to specify the encoding, otherwise it may become invalid.


4 ANSI

In addition, ANSI is very common in windows systems. In fact, ANSI is a Windows code pages. This mode selects the specific encoding according to the current locale, if the system locale is simplified Chinese, it adopts GBK encoding, while the traditional Chinese is BIG5 encoding, and the Japanese is JIS encoding.


In addition, windows like the BOM_UTF16_LE encoding is called Unicode, The BOM_UTF8 is called UTF-8. Some people say that the UTF-8 does not need BOM to mark, in fact, is not much, this is because the editor generally uses the UTF-8 by default to test the character encoding, if it can be successfully decoded, It is decoded with the UTF-8. Even if it was originally saved in ANSI, opening the file was the first to use UTF-8 for decoding. For example, if you use a windows Notepad program to create a new file, write "Audio Encoding" and save it in ANSI encoding, and open the file again, you will find that "Audio Encoding" will become "Han ".


5. instance analysis

Let's look at the encoding in various coding methods under windows. Open the windows Notepad program, respectively with ANSI, Unicode (actually BOM_UTF16_LE), Unicode Big endian, UTF-8 these encoding methods to see whether the final is the same as the previous analysis. Use UltraEdit to view hexadecimal encoding. You can enable the "edit"-"hexadecimal editing function to view hexadecimal encoding.


ANSI encoding is saved and the encoding is ca af. This also indicates that the GBK encoding storage also adopts the Big endian method.

Unicode encoding is saved. The encoding is ff fe F3 77.

Unicode Big endian encoding is saved, and the encoding is fe ff 77 F3.

The UTF-8 code is saved and the code is ef bb bf E7 9F B3.


6 References
  • Python character encoding
  • Ruan Yifeng: character encoding notes
  • I know: Windows notepad ANSI, Unicode, UTF-8 three encoding mode what is the difference?

Encoding of character sets

Mainly includes internal code and unicode

Unicode is character encoding. There are two common storage formats for unicode encoding: utf8 and utf16.
The difference between the two types is that the storage formats are different, but they are all unicode encoded. For example, the utf8 format of the Chinese "you" character is encoded as: E4 BD A0, while the utf16 encoding is: 60 4F. Sometimes, unicode encoding is generally UTF16, which is not rigorous.

Corresponds to the unicode encoding format. For example, the Chinese "you" character is encoded as C4 E3. Many internal codes overlap with each other in different languages. For example, the texts of Japanese internal codes are garbled in Chinese operating systems, while unicode codes have independent areas for languages without overlap, therefore, there is no obstacle to reading Japanese unicode in the Chinese operating system. This is the purpose of unicode design.

What is the character set and how to change the text encoding?

Simply put, the binary storage text encoding method is used. For details, refer to the encyclopedia link:
Baike.baidu.com/view/51987.htm

It is easy to modify the text encoding. use Notepad ++ to open the text file you want to modify, select "format"-> "ANSI format encoding/UTF8 format encoding...", and save it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.