Details about the strings and codes in Python, and details about python strings

Source: Internet
Author: User

Details about the strings and codes in Python, and details about python strings

Character encoding

As we have already discussed, strings are also a type of data, but there is another encoding problem that is special to strings.

Because the computer can only process numbers, if you want to process text, you must convert the text into numbers before processing. The earliest computer was designed to use eight bits as a byte. Therefore, the maximum integer represented by a word energy saving is 255 (Binary 11111111 = decimal 255 ), to represent a larger integer, you must use more bytes. For example, the maximum integer that two bytes can represent is 4294967295, and the maximum integer that four bytes can represent is.

Since computers were invented by Americans, only 127 letters were first encoded into the computer, that is, uppercase and lowercase English letters, numbers, and symbols. This encoding table is called ASCII encoding, for example, the uppercase letter A is 65, and the lowercase letter z is 122.

However, it is clear that one Chinese byte is not enough. It requires at least two bytes and cannot conflict with ASCII encoding. Therefore, China has developed GB2312 encoding to encode Chinese characters.

What you can think of is that there are hundreds of languages around the world. Japan has compiled Japanese into Shift_JIS, South Korea has compiled Korean into Euc-kr, and each country has its own standards, conflicts inevitably occur. As a result, garbled characters are displayed in multi-language texts.

char-encoding-problem

Therefore, Unicode came into being. Unicode unifies all languages into a set of encodings, so that there will be no garbled issues.

Unicode standards are also evolving, but the most common feature is to use two bytes to represent a single character (four bytes are required if very remote characters are used ). Modern Operating Systems and most programming languages support Unicode directly.

Currently, the difference between the two-byte ASCII encoding and the Unicode encoding: the ASCII encoding is 1 byte, while the Unicode encoding is usually 2 bytes.

The letter A is encoded in ASCII format 65 in decimal format and 01000001 in binary format;

The character 0 is ASCII encoded in 48 decimal digits and 00110000 in binary digits. Note that the character '0' is different from the integer 0;

The Chinese characters are beyond the ASCII encoding range. The Unicode encoding is 20013 in decimal format and 01001110 in binary format.

You can guess that if you use Unicode encoding for ASCII encoding A, you only need to add 0 in front. Therefore, the Unicode encoding for A is 00000000 01000001.

The new problem arises again: if Unicode encoding is unified, garbled characters will disappear. However, if all the text you write is in English, Unicode encoding requires twice as much storage space as ASCII encoding, which is not cost-effective in storage and transmission.

Therefore, in the spirit of saving, and appeared to convert Unicode encoding to "Variable Length Encoding" UTF-8 encoding. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different numbers, common English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, only uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, using UTF-8 encoding can save space:
Character ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
Medium x 01001110 00101101 11100100 10111000 10101101

From the above table we can also find that UTF-8 encoding has an additional benefit, that is, ASCII encoding can actually be seen as part of UTF-8 encoding, so, A large number of legacy software that only supports ASCII encoding can continue working under UTF-8 encoding.

After figuring out the relationship between ASCII, Unicode and UTF-8, We can summarize the common character encoding methods in computer systems:

In computer memory, Unicode encoding is used in a unified manner, when you need to save to the hard disk or need to transfer, it is converted to UTF-8 encoding.

When editing with notepad, The UTF-8 characters read from the file are converted to Unicode characters into memory, after editing is complete, save the Unicode conversion to the UTF-8 save to the file:

When browsing the Web page, the server will convert the dynamically generated Unicode content into a UTF-8 and then transmitted to the browser:

So you see a lot of web page source code will have similar <meta charset = "UTF-8"/> information, indicating that the web page is exactly the UTF-8 code.
Python string

After figuring out the troublesome character encoding problem, let's study Python's support for Unicode.

Because Python was born earlier than the Unicode standard, the earliest Python only supports ASCII encoding, and the common string 'abc' is inside Python. Python provides the ord () and chr () functions to convert letters and numbers:

>>> ord('A')65>>> chr(65)'A'

Python later added support for Unicode. The Unicode string is represented by U'... ', for example:

>>> Print u'chinese'> U' \ u4e2d in U'

In U', 'and U' \ u4e2d' are the same, and \ u is followed by a hexadecimal Unicode code. Therefore, u'a' and U' \ u0041 'are the same.

How do two strings convert each other? Although the string 'xxx' is ASCII encoding, it can also be considered as UTF-8 encoding, while u'xxx' can only be Unicode encoding.

Use the encode ('utf-8') method to convert u 'xxx' to 'xxx' encoded by UTF-8:

>>> U'abc '. encode ('utf-8') 'abc'> u'chinese '. encode ('utf-8') '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87'

The UTF-8 value after conversion is equal to the Unicode value (but the bucket used is different), and the Unicode Character After conversion is changed to three UTF-8 characters, as you can see, \ xe4 is one of the bytes. Because its value is 228 and no corresponding letter can be displayed, the value of the byte is displayed in hexadecimal notation. The len () function returns the string length:

>>> Len (u'abc') 3 >>> len ('abc') 3 >>> len (u'chinese ') 2 >>> len ('\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87') 6

In turn, convert the UTF-8 encoded string 'xxx' to the Unicode string u'xxx' using the decode ('utf-8') method:

>>> 'Abc '. decode ('utf-8') u'abc'> '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87 '. decode ('utf-8') U' \ u4e2d \ u6587 '> print' \ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87 '. decode ('utf-8') Chinese

Since Python source code is also a text file, when your source code contains Chinese, when saving the source code, you need to specify to save as UTF-8 encoding. When the Python interpreter reads the source code, we usually write these two lines at the beginning of the file to make it read in UTF-8 encoding:

#!/usr/bin/env python# -*- coding: utf-8 -*-

The first line of comment is to tell the Linux/OS X system that this is a Python executable program and will be ignored in Windows;

The second line of comment is to tell the Python interpreter to read the source code according to the UTF-8 encoding, otherwise, the Chinese output you write in the source code may be garbled.
Declaring the UTF-8 encoding does not mean that your. py file is UTF-8 encoded and must and make sure Notepad ++ is using the UTF-8 without BOM encoding:
If you use Notepad ++ for editing, in addition to #-*-coding: UTF-8-*-, the Chinese string must be a Unicode string:

Declaring the UTF-8 encoding does not mean that your. py file is UTF-8 encoded and must and make sure Notepad ++ is using the UTF-8 without BOM encoding:

If the. py file itself uses UTF-8 encoding, and also declares #-*-coding: UTF-8-*-, open the command prompt to test the normal display of Chinese:

Format

The last common problem is how to output formatted strings. We often output hello, like 'Dear xxx! Your phone bill for xx months is xx, your balance is xx, and other strings. The content of xxx changes according to variables. Therefore, you need a simple string formatting method.
In Python, the format is the same as that in C. It is implemented by %, for example:

>>> 'Hello, %s' % 'world''Hello, world'>>> 'Hi, %s, you have $%d.' % ('Michael', 1000000)'Hi, Michael, you have $1000000.'

As you may have guessed, the % operator is used to format strings. Inside the string, % s indicates replacing with a string, and % d Indicates replacing with an integer. How many %? Placeholder, followed by several variables or values. The order must be consistent. If there is only one % ?, Parentheses can be omitted.

Common placeholders include:

  • % D integer
  • % F floating point number
  • % S string
  • % X hexadecimal integer

To Format integers and floating-point numbers, you can also specify whether to fill in the digits of 0 and integer and decimal places:

>>> '%2d-%02d' % (3, 1)' 3-01'>>> '%.2f' % 3.1415926'3.14'

If you are not sure what to use, % s will always work, it will convert any data type to a string:

>>> 'Age: %s. Gender: %s' % (25, True)'Age: 25. Gender: True'

For Unicode strings, the usage is the same, but it is best to ensure that the replaced string is also a Unicode string:

>>> u'Hi, %s' % u'Michael'u'Hi, Michael'

In some cases, % in a string is a common character. What should I do? In this case, you need to escape and use % to indicate a %:

>>> 'growth rate: %d %%' % 7'growth rate: 7 %'

Summary

Due to historical issues, although Python 2.x supports Unicode, it requires two string representations: 'xxx' and u'xxx.

Python also supports other encoding methods, such as encoding Unicode into GB2312:

>>> U'chinese'. encode ('gb2312') '\ xd6 \ xd0 \ xce \ xc4'

However, this method is purely self-defeating. If there are no special business requirements, remember to use only Unicode and UTF-8 encoding methods.

In Python 3. in Version x, 'xxx' and u'xxx' are unified into Unicode encoding, that is, the write prefix u is the same, while the string expressed in bytes must be prefixed with B: B 'xxx '.

When formatting strings, you can use the Python interactive command line for testing, which is convenient and convenient.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.