A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
As we have already discussed, strings are also a type of data, but there is another encoding problem that is special to strings.
Because the computer can only process numbers, if you want to process text, you must convert the text into numbers before processing. The earliest computer was designed to use eight bits as a byte. Therefore, the maximum integer represented by a word energy saving is 255 (Binary 11111111 = decimal 255 ), to represent a larger integer, you must use more bytes. For example, the maximum integer that two bytes can represent is
65535, The maximum integer that can be expressed by four bytes is
Since computers were invented by Americans, only 127 letters were first encoded into computers, that is, uppercase and lowercase English letters, numbers, and symbols.
ASCIIEncoding, such as uppercase letters
AThe encoding is
65, Lowercase letters
zThe encoding is
However, it is clear that one byte is not enough to process Chinese characters. At least two bytes are required and cannot conflict with ASCII encoding.
GB2312Encoding used to compile Chinese characters.
What you can think of is that there are hundreds of languages around the world. Japan has compiled Japanese
Shift_JISSouth Korea makes up Korean
Euc-krChina, the standards of various countries will inevitably conflict. As a result, garbled characters are displayed in multi-language texts.
Therefore, Unicode came into being. Unicode unifies all languages into a set of encodings, so that there will be no garbled issues.
Unicode standards are also evolving, but the most common feature is to use two bytes to represent a single character (four bytes are required if very remote characters are used ). Modern Operating Systems and most programming languages support Unicode directly.
Currently, the difference between the two-byte ASCII encoding and the Unicode encoding: the ASCII encoding is 1 byte, while the Unicode encoding is usually 2 bytes.
AIt is in decimal format.
0It is in decimal format.
00110000, Note characters
MediumIt is beyond the range of ASCII encoding and uses unicode encoding in decimal format.
You can guess that if you encode
AUnicode encoding. You only need to add 0 to the front. Therefore,
AThe Unicode encoding of is
The new problem arises again: if unicode encoding is unified, garbled characters will disappear. However, if all the text you write is in English, Unicode encoding requires twice as much storage space as ASCII encoding, which is not cost-effective in storage and transmission.
Therefore, in the spirit of saving, Unicode encoding is converted into a variable length encoding.
UTF-8Encoding. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different numbers, common English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, only uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, using UTF-8 encoding can save space:
|Medium||X||01001110 00101101||11100100 10111000 10101101|
From the above table we can also find that UTF-8 encoding has an additional benefit, that is, ASCII encoding can actually be seen as part of UTF-8 encoding, so, A large number of legacy software that only supports ASCII encoding can continue working under UTF-8 encoding.
After figuring out the relationship between ASCII, Unicode and UTF-8, We can summarize the common character encoding methods in computer systems:
In computer memory, Unicode encoding is used in a unified manner, when you need to save to the hard disk or need to transfer, it is converted to UTF-8 encoding.
When editing with notepad, The UTF-8 characters read from the file are converted to unicode characters into memory, after editing is complete, save the Unicode conversion to the UTF-8 save to the file:
When browsing the Web page, the server will convert the dynamically generated Unicode content into a UTF-8 and then transmitted to the browser:
So you can see that the source code of many web pages is similar
<meta charset="UTF-8" />The information, indicating that the web page is coded by the UTF-8.
After figuring out the troublesome character encoding problem, let's study Python's support for Unicode.
Because python was born earlier than the Unicode standard, so the earliest Python only supports ASCII encoding, common strings
‘ABC‘In python, all are ASCII encoded. Python provides the ord () and CHR () functions to convert letters and numbers:
>>> ord(‘A‘)65>>> chr(65)‘A‘
Later, Python added support for Unicode.
>>> Print u'chinese'> U' \ u4e2d in U'
u‘\u4e2d‘Is the same,
\uIt is followed by a hexadecimal Unicode code. Therefore,
u‘\u0041‘The same is true.
How do two strings convert each other? String
‘xxx‘Although it is ASCII encoding, but can also be seen as UTF-8 encoding, and
u‘xxx‘Only unicode encoding is supported.
u‘xxx‘Convert to UTF-8-encoded
>>> U'abc '. encode ('utf-8') 'abc'> u'chinese '. encode ('utf-8') '\ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87'
The UTF-8 value after conversion is equal to the Unicode value (but the bucket used is different), and the Unicode Character After conversion is changed to three UTF-8 characters, what you see
\xe4Is a byte, because its value is
228, No corresponding letter can be displayed, so the byte value is displayed in hexadecimal format.
len()The function returns the length of the string:
>>> Len (u'abc') 3 >>> Len ('abc') 3 >>> Len (u'chinese ') 2 >>> Len ('\ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87') 6
In turn, encode the string represented by the UTF-8
‘xxx‘Convert to a unicode string
>>> 'Abc '. decode ('utf-8') u'abc'> '\ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87 '. decode ('utf-8') U' \ u4e2d \ u6587 '> Print' \ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87 '. decode ('utf-8') Chinese
Since Python source code is also a text file, when your source code contains Chinese, when saving the source code, you need to specify to save as UTF-8 encoding. When the python interpreter reads the source code, we usually write these two lines at the beginning of the file to make it read in UTF-8 encoding:
#!/usr/bin/env python# -*- coding: utf-8 -*-
The first line of comment is to tell the Linux/OS X system that this is a python executable program and will be ignored in windows;
The second line of comment is to tell the python interpreter to read the source code according to the UTF-8 encoding, otherwise, the Chinese output you write in the source code may be garbled.Format
The last common problem is how to output formatted strings. We often output similar
'Dear XXX, hello! Your phone bill for XX months is XX, and your balance is xx'And so on, and the content of XXX changes according to the variable. Therefore, a simple format method is required.
In python, the format is the same as that in C.
>>> ‘Hello, %s‘ % ‘world‘‘Hello, world‘>>> ‘Hi, %s, you have $%d.‘ % (‘Michael‘, 1000000)‘Hi, Michael, you have $1000000.‘
As you may have guessed,
%Operators are used to format strings. Inside the string,
%dRepresents the number of Integers to replace.
%?Placeholder, followed by several variables or values. The order must be consistent. If only one
%?Brackets can be omitted.
Common placeholders include:
|% F||Floating Point Number|
|% X||Hexadecimal integer|
To Format integers and floating-point numbers, you can also specify whether to fill in the digits of 0 and integer and decimal places:
>>> ‘%2d-%02d‘ % (3, 1)‘ 3-01‘>>> ‘%.2f‘ % 3.1415926‘3.14‘
If you are not sure what to use,
%sIt always works and converts any data type to a string:
>>> ‘Age: %s. Gender: %s‘ % (25, True)‘Age: 25. Gender: True‘
For Unicode strings, the usage is the same, but it is best to ensure that the replaced string is also a unicode string:
>>> u‘Hi, %s‘ % u‘Michael‘u‘Hi, Michael‘
%What should I do if it is a common character? In this case, you need to escape.
>>> ‘growth rate: %d %%‘ % 7‘growth rate: 7 %‘
Due to historical issues, although Python 2.x supports Unicode
Python also supports other encoding methods, such as encoding Unicode into gb2312:
>>> U'chinese'. encode ('gb2312') '\ xd6 \ xd0 \ xce \ xc4'
However, this method is purely self-defeating. If there are no special business requirements, remember to use only Unicode and UTF-8 encoding methods.
In Python 3. X
u‘xxx‘Unified into unicode encoding, that is, do not write the prefix
uAll are the same, while strings in bytes must be added
When formatting strings, you can use the python interactive command line for testing, which is convenient and convenient.
String and encoding
Start building with 50+ products and up to 12 months usage for Elastic Compute Service