String and encoding

Last Update:2014-09-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Character encoding

As we have already discussed, strings are also a type of data, but there is another encoding problem that is special to strings.

Because the computer can only process numbers, if you want to process text, you must convert the text into numbers before processing. The earliest computer was designed to use eight bits as a byte. Therefore, the maximum integer represented by a word energy saving is 255 (Binary 11111111 = decimal 255 ), to represent a larger integer, you must use more bytes. For example, the maximum integer that two bytes can represent is65535, The maximum integer that can be expressed by four bytes is4294967295.

Since computers were invented by Americans, only 127 letters were first encoded into computers, that is, uppercase and lowercase English letters, numbers, and symbols.ASCIIEncoding, such as uppercase lettersAThe encoding is65, Lowercase letterszThe encoding is122.

However, it is clear that one byte is not enough to process Chinese characters. At least two bytes are required and cannot conflict with ASCII encoding.GB2312Encoding used to compile Chinese characters.

What you can think of is that there are hundreds of languages around the world. Japan has compiled JapaneseShift_JISSouth Korea makes up KoreanEuc-krChina, the standards of various countries will inevitably conflict. As a result, garbled characters are displayed in multi-language texts.

Therefore, Unicode came into being. Unicode unifies all languages into a set of encodings, so that there will be no garbled issues.

Unicode standards are also evolving, but the most common feature is to use two bytes to represent a single character (four bytes are required if very remote characters are used ). Modern Operating Systems and most programming languages support Unicode directly.

Currently, the difference between the two-byte ASCII encoding and the Unicode encoding: the ASCII encoding is 1 byte, while the Unicode encoding is usually 2 bytes.

LetterAIt is in decimal format.65, Binary01000001;

Character0It is in decimal format.48, Binary00110000, Note characters‘0‘And integer0Is different;

Chinese charactersMediumIt is beyond the range of ASCII encoding and uses unicode encoding in decimal format.20013, Binary01001110 00101101.

You can guess that if you encodeAUnicode encoding. You only need to add 0 to the front. Therefore,AThe Unicode encoding of is00000000 01000001.

The new problem arises again: if unicode encoding is unified, garbled characters will disappear. However, if all the text you write is in English, Unicode encoding requires twice as much storage space as ASCII encoding, which is not cost-effective in storage and transmission.

Therefore, in the spirit of saving, Unicode encoding is converted into a variable length encoding.UTF-8Encoding. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different numbers, common English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, only uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, using UTF-8 encoding can save space:

Character	ASCII	Unicode	UTF-8
A	01000001	00000000 01000001	01000001
Medium	X	01001110 00101101	11100100 10111000 10101101

From the above table we can also find that UTF-8 encoding has an additional benefit, that is, ASCII encoding can actually be seen as part of UTF-8 encoding, so, A large number of legacy software that only supports ASCII encoding can continue working under UTF-8 encoding.

After figuring out the relationship between ASCII, Unicode and UTF-8, We can summarize the common character encoding methods in computer systems:

In computer memory, Unicode encoding is used in a unified manner, when you need to save to the hard disk or need to transfer, it is converted to UTF-8 encoding.

When editing with notepad, The UTF-8 characters read from the file are converted to unicode characters into memory, after editing is complete, save the Unicode conversion to the UTF-8 save to the file:

When browsing the Web page, the server will convert the dynamically generated Unicode content into a UTF-8 and then transmitted to the browser:

So you can see that the source code of many web pages is similar<meta charset="UTF-8" />The information, indicating that the web page is coded by the UTF-8.

Python string

After figuring out the troublesome character encoding problem, let's study Python's support for Unicode.

Because python was born earlier than the Unicode standard, so the earliest Python only supports ASCII encoding, common strings‘ABC‘In python, all are ASCII encoded. Python provides the ord () and CHR () functions to convert letters and numbers:

>>> ord(‘A‘)65>>> chr(65)‘A‘

Later, Python added support for Unicode.u‘...‘For example:

>>> Print u'chinese'> U' \ u4e2d in U'

WriteU'中'Andu‘\u4e2d‘Is the same,\uIt is followed by a hexadecimal Unicode code. Therefore,u‘A‘Andu‘\u0041‘The same is true.

How do two strings convert each other? String‘xxx‘Although it is ASCII encoding, but can also be seen as UTF-8 encoding, andu‘xxx‘Only unicode encoding is supported.

Setu‘xxx‘Convert to UTF-8-encoded‘xxx‘Useencode(‘utf-8‘)Method:

>>> U'abc '. encode ('utf-8') 'abc'> u'chinese '. encode ('utf-8') '\ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87'

The UTF-8 value after conversion is equal to the Unicode value (but the bucket used is different), and the Unicode Character After conversion is changed to three UTF-8 characters, what you see\xe4Is a byte, because its value is228, No corresponding letter can be displayed, so the byte value is displayed in hexadecimal format.len()The function returns the length of the string:

>>> Len (u'abc') 3 >>> Len ('abc') 3 >>> Len (u'chinese ') 2 >>> Len ('\ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87') 6

In turn, encode the string represented by the UTF-8‘xxx‘Convert to a unicode stringu‘xxx‘Usedecode(‘utf-8‘)Method:

>>> 'Abc '. decode ('utf-8') u'abc'> '\ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87 '. decode ('utf-8') U' \ u4e2d \ u6587 '> Print' \ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87 '. decode ('utf-8') Chinese

Since Python source code is also a text file, when your source code contains Chinese, when saving the source code, you need to specify to save as UTF-8 encoding. When the python interpreter reads the source code, we usually write these two lines at the beginning of the file to make it read in UTF-8 encoding:

#!/usr/bin/env python# -*- coding: utf-8 -*-

The first line of comment is to tell the Linux/OS X system that this is a python executable program and will be ignored in windows;

The second line of comment is to tell the python interpreter to read the source code according to the UTF-8 encoding, otherwise, the Chinese output you write in the source code may be garbled.

Format

The last common problem is how to output formatted strings. We often output similar'Dear XXX, hello! Your phone bill for XX months is XX, and your balance is xx'And so on, and the content of XXX changes according to the variable. Therefore, a simple format method is required.

In python, the format is the same as that in C.%For example:

>>> ‘Hello, %s‘ % ‘world‘‘Hello, world‘>>> ‘Hi, %s, you have $%d.‘ % (‘Michael‘, 1000000)‘Hi, Michael, you have $1000000.‘

As you may have guessed,%Operators are used to format strings. Inside the string,%sString replacement,%dRepresents the number of Integers to replace.%?Placeholder, followed by several variables or values. The order must be consistent. If only one%?Brackets can be omitted.

Common placeholders include:

% D	Integer
% F	Floating Point Number
% S	String
% X	Hexadecimal integer

To Format integers and floating-point numbers, you can also specify whether to fill in the digits of 0 and integer and decimal places:

>>> ‘%2d-%02d‘ % (3, 1)‘ 3-01‘>>> ‘%.2f‘ % 3.1415926‘3.14‘

If you are not sure what to use,%sIt always works and converts any data type to a string:

>>> ‘Age: %s. Gender: %s‘ % (25, True)‘Age: 25. Gender: True‘

For Unicode strings, the usage is the same, but it is best to ensure that the replaced string is also a unicode string:

>>> u‘Hi, %s‘ % u‘Michael‘u‘Hi, Michael‘

Sometimes%What should I do if it is a common character? In this case, you need to escape.%%To indicate%:

>>> ‘growth rate: %d %%‘ % 7‘growth rate: 7 %‘

Summary

Due to historical issues, although Python 2.x supports Unicode‘xxx‘Andu‘xxx‘String representation.

Python also supports other encoding methods, such as encoding Unicode into gb2312:

>>> U'chinese'. encode ('gb2312') '\ xd6 \ xd0 \ xce \ xc4'

However, this method is purely self-defeating. If there are no special business requirements, remember to use only Unicode and UTF-8 encoding methods.

In Python 3. X‘xxx‘Andu‘xxx‘Unified into unicode encoding, that is, do not write the prefixuAll are the same, while strings in bytes must be addedbPrefix:b‘xxx‘.

When formatting strings, you can use the python interactive command line for testing, which is convenient and convenient.

String and encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

String and encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

String and encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support