Python character encoding

Source: Internet
Author: User

Strings are also a type of data, but a special string is a coding problem.

Because a computer can only handle numbers, if you are working with text, you must convert the text to a number before processing it. The oldest computer was designed with 8 bits (bit) as a byte (byte), so the largest integer that a Word energy saver represents is 255 (binary 11111111 = decimal 255), and if you want to represent a larger integer, you must use more bytes. For example, two bytes can represent the largest integer is 65535, 4 bytes can represent the largest integer is 4294967295.

Unicode unifies all the languages into a set of encodings so that no more garbled questions are made.

The Unicode standard is also evolving, but it is most commonly used to represent a character in two bytes, and it takes 4 bytes to use a very remote character. Unicode is now supported directly by the operating system and most programming languages.

In computer memory, Unicode encoding is used uniformly, and is converted to UTF-8 encoding when it needs to be persisted to the hard disk or when it needs to be transferred.

When editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into memory, and when the edits are complete, the conversion of Unicode to UTF-8 is saved to the file:

When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser:

So, you see a lot of the source of the Web page will have similar <meta charset= "UTF-8"/> Information, indicating that the page is formally used UTF-8 encoding.

Python string:

In the latest version of Python3, strings are encoded in Unicode, meaning that Python's strings support multiple languages.

For the encoding of a single character, Python provides an integer representation of the Ord () function to get the character, and the Chr () function converts the encoding to the corresponding character:

If you know the integer encoding of the character, you can also write the str in hexadecimal:

The two formulations are completely equivalent.

Because Python's string type is str, it is represented in memory in Unicode and a character corresponds to several bytes. If you want to transfer on the network, or save to disk, you need to turn str into bytes in bytes.

Python uses single or double quotation marks with a B prefix for data of type bytes:

x = b'ABC'

Be aware of the distinction between ' abc ' and ' B ' abc ', which is STR, although the content is shown in the same way as the former, but each character of bytes occupies only one byte.

The STR represented in Unicode can be encoded as a specified bytes by using the Encode () method, for example:

The English-language STR can be ASCII encoded as bytes, the content is the same, the Chinese-containing STR can be encoded with UTF-8 bytes. STR, which contains Chinese, cannot be ASCII encoded because the range of Chinese encodings exceeds the ASCII encoding range and Python will error.

In bytes, the bytes that cannot be displayed as ASCII characters are #显示 with \x#.

Conversely, if we read the byte stream from the network or disk, then the data read is bytes. To turn bytes into STR, you need to use the Decode () method:

To calculate how many characters a str contains, you can use the Len () function:

The Len () function calculates the number of characters in STR and computes the number of bytes if replaced by the Butes,len () function:

As can be seen, 1 Chinese characters are UTF-8 encoded and typically consume 3 bytes, while 1 English characters take up only 1 bytes.

When manipulating strings, we often encounter mutual conversions between Str and bytes. To avoid garbled problems, you should always use UTF-8 encoding to convert str and bytes.

Because the Python source code is also a text file, so when your source code contains Chinese, it is important to specify that you save it as UTF-8 encoding when you save it. When the Python interpreter reads the source code, in order for it to be read by UTF-8 encoding, we usually write at the beginning of the file:

# !/usr/bin/env Python3 # -*-coding:utf-8-*-

The first line of comments is to tell the Linux/os x system that this is a python executable and the Windows system ignores this comment;

The second line of comments is to tell the Python interpreter to install UTF-8 encoding to read the source code, otherwise, you write in the source code in the Chinese output may be garbled.

Affirming that UTF-8 encoding does not mean that your. py file is UTF-8 encoded, you must and make sure that the text editor is using UTF-8 without BOM encoding:

If the. py file itself uses UTF-8 encoding and also affirms #-*-coding:utf-8-*-, opening the command prompt test will display the Chinese normally.

Formatting:

The last common question is how to output formatted strings, and we will often output similar to ' Dear XX Hello! Your xx month's charge is XX, the balance is xx ' such as the string, and the content of XXX will vary according to the variable, so, need a simple way to format the string.

In Python, the format used is consistent with the C language and is implemented in%, as follows:

As you may have guessed, the% operator is used to format the string. Inside the string,%s is replaced with a string,%d is replaced with an integer, there are several% placeholder, followed by a number of variables or values, the order to correspond well. If there is only one%, the parentheses can be omitted.

Common placeholders are:

%d integers

%f floating Point

%s string

%x hexadecimal integer

where formatted integers and floating-point numbers can also specify whether to complement 0 and the number of digits of integers and decimals:

If you're not sure what to use,%s will always work, and it will convert any data type to a string:

Sometimes, what happens if the% inside the string is a normal character? This time you need to escape, with a percent of%:

The Python3 string uses Unicode and supports multiple languages directly.

When Str and bytes are converted to each other, they need to be coded. The most commonly used encoding is UTF-8.

When formatting a string, you can test it with Python's interactive command line.

Python character encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.