Python3. X Resume (9)--string and code (who knows my details)

Source: Internet
Author: User
Tags ming ord string format

The details determine success or failure, be sure to remember their strengths and weaknesses. To know oneself is a life thing.

-----------Hashlinux

Character encoding

As we've already said, strings are also a type of data, but a special string is a coding problem.

Because a computer can only handle numbers, if you are working with text, you must convert the text to a number before processing it. The oldest computer was designed with 8 bits (bit) as a byte (byte), so a single word energy-saving representation of the largest integer is 255 (binary 11111111 = decimal 255), if you want to represent a larger integer, you must use more bytes. For example, two bytes can represent the largest integer is 65535 , 4 bytes can represent the largest integer is 4294967295 .

Since the computer was invented by the Americans, only 127 letters were encoded into the computer, that is, letters, numbers, and symbols, which are called ASCII encodings, such as uppercase letters encoded in A 65 lowercase letters z 122 .

But to deal with the Chinese is clearly a byte is not enough, at least two bytes, but also can't and ASCII encoding conflict, so, China has developed a GB2312 code to put Chinese into.

What you can imagine is that there are hundreds of languages all over the world, Japan has made it Shift_JIS in Japanese, South Korea has made it into the Korean language, and Euc-kr countries have standards that inevitably clash, and the result is that in the mixed text of multiple languages, there will be garbled characters.

650) this.width=650; "Src=" http://www.liaoxuefeng.com/files/attachments/ 0013872491802084161ec9ef7d143a897e1584819535656000/0 "alt=" Char-encoding-problem "/>

As a result, Unicode emerges. Unicode unifies all languages into a set of encodings, so there is no more garbled problem.

The Unicode standard is also evolving, but it is most commonly used to represent a character in two bytes (4 bytes If a very remote character is used). Unicode is supported directly by modern operating systems and most programming languages.

Now, the difference between ASCII encoding and Unicode encoding is smoothed: ASCII encoding is 1 bytes, and Unicode encoding is usually 2 bytes.

Letters A with ASCII encoding are decimal 65 , binary 01000001 ;

Characters 0 with ASCII encoding are decimal 48 , binary 00110000 , and note that ‘0‘ the characters and integers 0 are different;

Chinese characters are beyond the ASCII encoding range, Unicode encoding is decimal 20013 , binary 01001110 00101101 .

You can guess that if you encode ASCII code in A Unicode, you just need to make 0 in front, so A the Unicode encoding is 00000000 01000001 .

The new problem arises again: If Unicode encoding is unified, the garbled problem disappears. However, if you write text that is basically all in English, using Unicode encoding requires more storage space than ASCII encoding, which is not cost-effective in storage and transmission.

Therefore, in the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:

character ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
In X 01001110 00101101 11100100 10111000 10101101

It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

Figuring out the relationship between ASCII, Unicode, and UTF-8, we can summarize how the current computer system works with character encoding:

In computer memory, Unicode encoding is used uniformly, and is converted to UTF-8 encoding when it needs to be saved to the hard disk or when it needs to be transferred.

When editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into memory, and when the edits are complete, the conversion of Unicode to UTF-8 is saved to the file:

650) this.width=650; "Src=" http://www.liaoxuefeng.com/files/attachments/ 001387245992536e2ba28125cf04f5c8985dbc94a02245e000/0 "alt=" Rw-file-utf-8 "/>

When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser:

650) this.width=650; "Src=" http://www.liaoxuefeng.com/files/attachments/ 001387245979827634fd6204f9346a1ae6358d9ed051666000/0 "alt=" Web-utf-8 "/>

So you see a lot of pages of the source code will have similar <meta charset="UTF-8" /> information, that the page is exactly the UTF-8 encoding.

A Python string

After figuring out the annoying character coding problem, we'll look at the python string.

In the latest version of Python 3, strings are encoded in Unicode, meaning that Python strings support multiple languages, such as:

>>> print (' str with Chinese ') contains Chinese str

For the encoding of a single character, Python provides an ord() integer representation of the function to get the character, and the chr() function converts the encoding to the corresponding character:

>>> Ord (' A ') 65>>> ord (' Middle ') 20013>>> chr ("The ' B ' >>> chr (25991) ' text ')

If you know the integer encoding of a character, you can also write it in hexadecimal str :

>>> ' \u4e2d\u6587 ' Chinese

The two formulations are completely equivalent.

Because the Python string type is str , in memory, in Unicode, one character corresponds to a number of bytes. If you want to transfer on a network, or save to disk, you need to turn it str into bytes bytes .

Python bytes uses b a prefixed single or double quotation mark for data of type:

x = B ' ABC '

Be aware of the distinction ‘ABC‘ and the b‘ABC‘ former is str that although the content is displayed in the same way as the former, bytes each character occupies only one byte.

The str pass method, expressed in Unicode encode() , can be encoded as specified bytes , for example:

>>> ' abc '. Encode (' ASCII ') b ' abc ' >>> ' Chinese '. Encode (' utf-8 ') b ' \xe4\xb8\xad\xe6\x96\x87 ' >>> ' Chinese '. Encode (' ASCII ') Traceback (most recent call last): File "<stdin>", line 1, in <module>unicodeencodeerror : ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal not in range (128)

Pure English str can be ASCII encoded as bytes , content is the same, containing Chinese str can be UTF-8 encoded as bytes . strcannot be encoded in Chinese ASCII because the range of Chinese encoding exceeds the range of the ASCII encoding, and Python will make an error.

In bytes , the bytes that cannot be displayed as ASCII characters are \x## displayed.

Conversely, if we read the byte stream from the network or disk, then the data read is bytes . To turn bytes str it into, you need to use the decode() method:

>>> b ' abc '. DECODE (' ASCII ') ' abc ' >>> b ' \xe4\xb8\xad\xe6\x96\x87 '. Decode (' utf-8 ') ' Chinese '

To calculate str how many characters are included, you can use a len() function:

>>> len (' ABC ') 3>>> len (' Chinese ') 2

len()The function calculates the str number of characters, and if bytes so, the len() function calculates the number of bytes:

>>> Len (b ' ABC ') 3>>> len (b ' \xe4\xb8\xad\xe6\x96\x87 ') 6>>> len (' Chinese ' encode (' Utf-8 ')) 6

As can be seen, 1 Chinese characters are UTF-8 encoded and typically consume 3 bytes, while 1 English characters take up only 1 bytes.

We often encounter str and convert to and bytes from each other when manipulating strings. In order to avoid garbled problems, we should always adhere to the use of UTF-8 encoding str and bytes conversion.

Because the Python source code is also a text file, so when your source code contains Chinese, it is important to specify that you save it as UTF-8 encoding when you save it. When the Python interpreter reads the source code, in order for it to be read by UTF-8 encoding, we usually write these two lines at the beginning of the file:

#!/usr/bin/env python3#-*-coding:utf-8-*-

The first line of comments is to tell the Linux/os x system that this is a python executable and the Windows system ignores this comment;

The second line of comments is to tell the Python interpreter to read the source code according to the UTF-8 encoding, otherwise the Chinese output you write in the source code may be garbled.

Affirming that UTF-8 encoding does not mean that your .py file is UTF-8 encoded, you must and make sure that the text editor is using UTF-8 without BOM encoding:

650) this.width=650; "Src=" http://www.liaoxuefeng.com/files/attachments/ 001427719248811c5f9fd37acf54f6f93d7affbd80dd79b000 "alt=" set-encoding-in-notepad++ "/>

If the .py file itself uses UTF-8 encoding and is also stated # -*- coding: utf-8 -*- , open the command prompt test to display the Chinese as normal:

650) this.width=650; "Src=" http://www.liaoxuefeng.com/files/attachments/ 0014277193240041efdda5a5de14f58a0879d8d4efcee66000 "alt=" Py-chinese-test-in-cmd "/>

Formatting

The last common question is how to output a formatted string. We often output strings that are similar, ‘亲爱的xxx你好!你xx月的话费是xx,余额是xx‘ and XXX's content varies by variable, so a simple way to format a string is required.

650) this.width=650; "Src=" http://www.liaoxuefeng.com/files/attachments/ 001389579690189985ca83044bd4aa7a80c47f9296a5c4e000/0 "alt=" Py-str-format "/>

In Python, the format used is consistent with the C language, and is implemented as an % example:

>>> ' Hello,%s '% ' world ' Hello, world ' >>> ' Hi,%s, you have $%d. '% (' Michael ', 1000000) ' Hi, Michael, y The OU has $1000000. '

As you may have guessed, the % operator is used to format the string. Inside the string, the representation is replaced by a string, which %s %d is replaced with an integer, there are several %? placeholders, followed by a number of variables or values, the order to correspond well. If there is only one %? , the parentheses can be omitted.

Common placeholders are:

%d Integer
%f Floating point number
%s String
%x hexadecimal integer

where formatted integers and floating-point numbers can also specify whether to complement 0 and the number of digits of integers and decimals:

>>> '%2d-%02d '% (3, 1) ' 3-01 ' >>> '%.2f '% 3.1415926 ' 3.14 '

If you're not sure what to use, %s it'll always work, and it will convert any data type to a string:

>>> ' Age:%s. Gender:%s '% (+ True) ' age:25. Gender:true '

Sometimes, % what about a normal character inside a string? This time you need to escape and use it %% to represent one % :

>>> ' growth rate:%d percent '% 7 ' growth rate:7% '
Practice

Xiao Ming's score from last year's 72 points to 85 points this year, please calculate the percentage of small Ming performance increase, and the string format ‘xx.x%‘ to show that only the decimal point after 1 bits:

#-*-Coding:utf-8-*-s1 = 72S2 = 85
Summary

The Python 3 string uses Unicode and supports multiple languages directly.

When Str and bytes are converted to each other, the encoding needs to be specified. The most commonly used encoding is UTF-8. Python, of course, also supports other encodings, such as encoding Unicode into GB2312:

>>> ' Chinese '. Encode (' gb2312 ') ' \xd6\xd0\xce\xc4 '

But this is a purely trouble-free way, and if there are no special business requirements, remember to use only UTF-8 encoding.

When you format a string, you can use Python's interactive command line to test it quickly and easily.


This article is from the "lake and Laughter" blog, please make sure to keep this source http://hashlinux.blog.51cto.com/9647696/1792723

Python3. X Resume (9)--string and code (who knows my details)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.