Character encoding

Source: Internet
Author: User

Label:

Character encoding

Reference:

Http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/ 001386819196283586a37629844456ca7e5a7faa9b94ee8000

As we've already said, strings are also a type of data, but a special string is a coding problem.

Because a computer can only handle numbers, if you are working with text, you must convert the text to a number before processing it. The oldest computer was designed with 8 bits (bit) as a byte (byte), so a single word energy-saving representation of the largest integer is 255 (binary 11111111 = decimal 255), if you want to represent a larger integer, you must use more bytes. For example, two bytes can represent the largest integer is 65535, 4 bytes can represent the largest integer is 4294967295.

Since the computer was invented by the Americans, only 127 letters were encoded into the computer, that is, letters, numbers, and symbols, which are referred to as ASCII encoding, such as the code for capital A is 65, and the lower case z is encoded as 122.

But to deal with the Chinese is clearly a byte is not enough, at least two bytes, but also cannot and ASCII encoding conflict, so, China has developed a GB2312 code, used to put Chinese into.

What you can imagine is that there are hundreds of languages all over the world, Japan has made Japanese into Shift_JIS, South Korea has made Korean into EUC-KR, and countries have standards, inevitably conflict, the result is that in multi-language mixed text, the display will be garbled.

As a result, Unicode emerges. Unicode unifies all languages into a set of encodings, so there is no more garbled problem.

The Unicode standard is also evolving, but it is most commonly used to represent a character in two bytes (4 bytes If a very remote character is used). Unicode is supported directly by modern operating systems and most programming languages.

Now, the difference between ASCII encoding and Unicode encoding is smoothed: ASCII encoding is 1 bytes, and Unicode encoding is usually 2 bytes.

The letter A with ASCII encoding is decimal 65, binary 01000001;

The character 0 is in ASCII encoding is decimal 48, binary 00110000, note that the character ' 0 ' and the integer 0 are different;

The ASCII encoding range has been exceeded in Chinese characters, with Unicode encoding being decimal 20013, binary 01001110 00101101.

You can guess that if you encode ASCII-encoded A in Unicode, you only need to make 0 on the front, so the Unicode encoding for A is 00000000 01000001.

The new problem arises again: If Unicode encoding is unified, the garbled problem disappears. However, if you write text that is basically all in English, using Unicode encoding requires more storage space than ASCII encoding, which is not cost-effective in storage and transmission.

Therefore, in the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:

It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

Figuring out the relationship between ASCII, Unicode, and UTF-8, we can summarize how the current computer system works with character encoding:

In computer memory, Unicode encoding is used uniformly, and is converted to UTF-8 encoding when it needs to be saved to the hard disk or when it needs to be transferred.

When editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into memory, and when the edits are complete, the conversion of Unicode to UTF-8 is saved to the file:

When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser:

So you see many pages of the source will have similar <meta charset= "UTF-8"/> Information, that the page is exactly the UTF-8 code.

A Python string

After figuring out the annoying character coding problem, we'll look at Python's support for Unicode.

Because Python was born earlier than the Unicode standard, the earliest Python only supported ASCII encoding, and the normal string ' ABC ' was ASCII-encoded inside python. Python provides the Ord () and Chr () functions to convert letters and corresponding numbers to each other:

>>> Ord (' A ')

65

>>> Chr (65)

A

Python later added support for Unicode, with a Unicode-represented string expressed in U ' ... ', for example:

>>> print u ' Chinese '

Chinese

>>> u ' in '

U ' \u4e2d '

Write U ' in ' and U ' \u4e2d ' is the same, \u is followed by a hexadecimal Unicode code. Therefore, U ' A ' and U ' \u0041 ' are the same.

How do two strings convert to each other? Although the string ' xxx ' is ASCII encoded, it can also be seen as UTF-8 encoding, while U ' xxx ' can only be Unicode encoded.

convert u ' xxx ' to UTF-8 encoded ' xxx ' with encode (' Utf-8 ') method:

>>> u ' ABC '. Encode (' Utf-8 ')

' ABC '

>>> u ' Chinese '. Encode (' utf-8 ') '

\xe4\xb8\xad\xe6\x96\x87 '

The UTF-8 value represented by the English character conversion is equal to the Unicode value (but with a different storage space), and the 1 Unicode characters converted into 3 UTF-8 characters after the Chinese character conversion, and the \xe4 you see is one of the bytes, because its value is 228. No corresponding letters can be displayed, so the numeric value of the bytes is displayed in hexadecimal. The Len () function can return the length of a string:

>>> len (U ' ABC ')

3

>>> len (' ABC ')

3

>>> len (U ' Chinese ')

2

>>> len (' \xe4\xb8\xad\xe6\x96\x87 ')

6

In turn, convert the UTF-8 encoded string ' xxx ' to a unicode string u ' xxx ' using the decode (' Utf-8 ') method:

>>> ' abc '. Decode (' Utf-8 ')

U ' abc '

>>> ' \xe4\xb8\xad\xe6\x96\x87 '. Decode (' Utf-8 ')

U ' \u4e2d\u6587 '

>>> print ' \xe4\xb8\xad\xe6\x96\x87 '. Decode (' Utf-8 ')

Chinese

Because the Python source code is also a text file, so when your source code contains Chinese, it is important to specify that you save it as UTF-8 encoding when you save it. When the Python interpreter reads the source code, in order for it to be read by UTF-8 encoding, we usually write these two lines at the beginning of the file:

#!/usr/bin/env python#-*-coding:utf-8-*-

The first line of comments is to tell the Linux/os x system that this is a python executable and the Windows system ignores this comment;

The second line of comments is to tell the Python interpreter to read the source code according to the UTF-8 encoding, otherwise the Chinese output you write in the source code may be garbled.

If you use notepad++ for editing, the Chinese string must be a Unicode string except to add the #-*-Coding:utf-8-*-:

Affirming that UTF-8 encoding does not mean that your. py file is UTF-8 encoded, you must and make sure that notepad++ is using UTF-8 without BOM encoding:

If the. py file itself uses UTF-8 encoding, and also affirms the #-*-Coding:utf-8-*-, open the command prompt test to display the Chinese normally:

Formatting

The last common question is how to output a formatted string. We will often output similar ' Dear XXX Hello! You xx month's bill is XX, the balance is xx ' and so on the string, and the XXX content is varies according to the variable, therefore, needs a simple format string the way.

In Python, the format used is consistent with the C language and is implemented as a%, for example:

>>> ' Hello,%s '% ' world '

' Hello, World '

>>> ' Hi,%s, you have $%d. '% (' Michael ', 1000000)

' Hi, Michael, you have $1000000. '

As you may have guessed, the% operator is used to format the string. Inside the string,%s is replaced with a string,%d is replaced with an integer, there are several% placeholder, followed by a number of variables or values, the order to correspond well. If there is only one%, the parentheses can be omitted.

Common placeholders are:

where formatted integers and floating-point numbers can also specify whether to complement 0 and the number of digits of integers and decimals:

>>> '%2d-%02d '% (3, 1)

' 3-01 '

>>> '%.2f '% 3.1415926

' 3.14 '

If you're not sure what to use,%s will always work, and it will convert any data type to a string:

>>> ' Age:%s. Gender:%s '% (True)

' Age:25. Gender:true '

For Unicode strings, the usage is exactly the same, but it is best to ensure that the substituted string is also a Unicode string:

>>> u ' Hi,%s '% u ' Michael '

U ' Hi, Michael '

Sometimes, what happens if the% inside the string is a normal character? This time you need to escape, with a percent of%:

>>> ' growth rate:%d percent '% 7

Summary of ' growth rate:7% '

Due to legacy issues, the Python 2.x version supports Unicode, but syntax requires both ' xxx ' and ' u ' xxx ' string representations.

Python, of course, also supports other encodings, such as encoding Unicode into GB2312:

>>> u ' Chinese '. Encode (' gb2312 ')

' \xd6\xd0\xce\xc4 '

But this is a purely trouble-free way, and if there are no special business requirements, keep in mind that only Unicode and UTF-8 are used for encoding.

In the Python 3.x version, the ' xxx ' and u ' xxx ' unified into Unicode encoding, that is, the write does not write the prefix u is the same, and the byte representation of the string must be prefixed with B: B ' xxx '.

When you format a string, you can use Python's interactive command line to test it quickly and easily.

Character encoding

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: