Python Study Notes-Unicode

Last Update:2018-12-04 Source: Internet

Author: User

Tags printable characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Here is a brief introduction. (The following content is basically extracted from Python. Core. programming.2ed)

Unicode is a secret weapon that computers can support multiple languages on the planet. Before Unicode, it uses ASCII and ASCII, each English character is stored in a computer in the form of a 7-bit binary number. The value range is 32 to 126. its implementation principle is not mentioned here.

However, the ASCII code can only represent 95 printable characters. Later, the ASCII code is extended to 8 characters, which can represent 223 characters. Although this indicates that the European and American letter languages are sufficient, however, there are too few languages like Chinese. So the Unicode code was born.

Unicode represents a character by using one or more bytes, which breaks through the ASCII limit. In this way, Unicode can represent more than 90000 characters.

Python
And Unicode

To make the Unicode and ASCII value strings look as similar as possible, the python string has changed from the original simple data type to the real object, and the ASCII string has become
Stringtype, while Unicode strings become the unicodetype type, and their behavior is very similar. The string module contains corresponding processing functions.
The string module has stopped updating and only supports ASCII codes. The string module is not recommended. Do not use this mode in any code that requires Unicode compatibility.
Python retains this module only for backward compatibility.

By default, all literal strings in Python are encoded in ASCII format. You can declare a unicode string by adding a 'u' prefix to the string, the 'U' prefix tells the strings following python to be compiled into Unicode strings.

>>> "Hello World"
# ASCII string

'Hello world'

>>> U "Hello World"
# Unicode string

U'hello world'

Built-in STR () and CHR () functions cannot process Unicode. They can only process regular ASCII encoded strings. If a unicode string is passed as a parameter to the STR () function, it is first converted to an ASCII string and then handed over to the STR () function.

Codecs

Codec is a combination of the first letters of coder/decoder. It defines the conversion method between text and binary, and the conversion method of converting characters into numbers in one byte with ASCII.
Unicode uses multiple bytes, which leads to Unicode support for multiple encoding methods. For example, codec supports four familiar encoding methods.
Yes: ASCII, ISO8859-1/Latin-1, UTF-8, and UTF-16

The most famous is UTF-8 encoding, which also uses a byte to encode ASCII characters, which makes it easy for programmers who must simultaneously process both ASCII and Unicode code texts, because the UTF-8 and ASCII encoding of ASCII characters are exactly the same.

UTF-8 encoding can use 1 to 4 bytes to represent characters in other languages, which makes trouble for programmers who need to process Unicode data directly, because they cannot follow a fixed length one by one
Fortunately, we don't need to know how to directly read Unicode data. Python has completed the relevant details for us, and we don't need to deal with the complicated problem of multi-byte characters.
Worried.

UTF-16 is also a variable-length encoding, but it is not commonly used.

Encoding and decoding

Unicode supports multiple encoding formats, which puts an extra burden on programmers. Whenever you write a string to a file, you must define an encoding to convert the corresponding Unicode content
In your defined format. Python solves this problem through the encode () function of the Unicode string. This function accepts the characters in the string as parameters and outputs the encoding format you specified.
.

Therefore, every time we write a unicode string to the disk, we need to use the specified encoder to "encode" it. Correspondingly, when we read data from this file, we must "decode" the file to make it a unicode String object.

Simple Example:

The following code creates a unicode string, encodes it with a UTF-8 encoder, then writes it to a file, then reads the data from the file back, decodes it into a unicode String object, finally, print the Unicode string to confirm that the program runs correctly.

In Linux, enter the following code in vim and save it as uniile. py. The red letter is the comment I added.

#
/Home/Xiaopeng/Python/code/uniile. py

'''

An example
Reading and Writing Unicode strings: writes

A Unicode string
To a file in UTF-8 and reads it back in

'''

Codec =
'Utf-8'
Encoding Method

File =
'Unicode.txt 'file name to be saved

Hello_out = u "Hello
World/N "creates a unicode string

Bytes_out =
Hello_out.encode (codec)
Coded by UTF-8

F =
Open (file, 'w ')

F. Write (bytes_out)
Writing to a specified file

F. Close ()

F = open (file, 'R ')

Bytes_in =
F. Read ()
Read

F. Close ()

Hello_in =
Bytes_in.decode (codec)
Decoding

Print
Hello_in Printing

Enter Python unifile. py in the terminal

The result is Hello world.

Then we will find another file named unicode.txt in the pythondirectory. Run the cat command to check the file and find that the content is the same as the printed result.

Pay attention to the following four points when applying Unicode to reality:

1. When a string appears in the program, you must add a prefix U.

2
Do not use the STR () function, instead of Unicode ().

3
Do not use outdated string modules. If it is passed to a non-ASCII code, it will mess up everything.

4
Do not codec Unicode characters in your program when they are not required. You can call the encode () function and decode () function only when you want to write files, databases, or networks.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More