String and encoding

Source: Internet
Author: User
Character encoding

As we have already discussed, strings are also a type of data, but there is another encoding problem that is special to strings.

Because the computer can only process numbers, if you want to process text, you must convert the text into numbers before processing. The earliest computer was designed to use eight bits as a byte. Therefore, the maximum integer represented by a word energy saving is 255 (Binary 11111111 = decimal 255 ), to represent a larger integer, you must use more bytes. For example, the maximum integer that two bytes can represent is65535, The maximum integer that can be expressed by four bytes is4294967295.

Since computers were invented by Americans, only 127 letters were first encoded into computers, that is, uppercase and lowercase English letters, numbers, and symbols.ASCIIEncoding, such as uppercase lettersAThe encoding is65, Lowercase letterszThe encoding is122.

However, it is clear that one byte is not enough to process Chinese characters. At least two bytes are required and cannot conflict with ASCII encoding.GB2312Encoding used to compile Chinese characters.

What you can think of is that there are hundreds of languages around the world. Japan has compiled JapaneseShift_JISSouth Korea makes up KoreanEuc-krChina, the standards of various countries will inevitably conflict. As a result, garbled characters are displayed in multi-language texts.

Therefore, Unicode came into being. Unicode unifies all languages into a set of encodings, so that there will be no garbled issues.

Unicode standards are also evolving, but the most common feature is to use two bytes to represent a single character (four bytes are required if very remote characters are used ). Modern Operating Systems and most programming languages support Unicode directly.

Currently, the difference between the two-byte ASCII encoding and the Unicode encoding: the ASCII encoding is 1 byte, while the Unicode encoding is usually 2 bytes.

LetterAIt is in decimal format.65, Binary01000001;

Character0It is in decimal format.48, Binary00110000, Note characters‘0‘And integer0Is different;

Chinese charactersMediumIt is beyond the range of ASCII encoding and uses unicode encoding in decimal format.20013, Binary01001110 00101101.

You can guess that if you encodeAUnicode encoding. You only need to add 0 to the front. Therefore,AThe Unicode encoding of is00000000 01000001.

The new problem arises again: if unicode encoding is unified, garbled characters will disappear. However, if all the text you write is in English, Unicode encoding requires twice as much storage space as ASCII encoding, which is not cost-effective in storage and transmission.

Therefore, in the spirit of saving, Unicode encoding is converted into a variable length encoding.UTF-8Encoding. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different numbers, common English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, only uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, using UTF-8 encoding can save space:

Character ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
Medium X 01001110 00101101 11100100 10111000 10101101

From the above table we can also find that UTF-8 encoding has an additional benefit, that is, ASCII encoding can actually be seen as part of UTF-8 encoding, so, A large number of legacy software that only supports ASCII encoding can continue working under UTF-8 encoding.

After figuring out the relationship between ASCII, Unicode and UTF-8, We can summarize the common character encoding methods in computer systems:

In computer memory, Unicode encoding is used in a unified manner, when you need to save to the hard disk or need to transfer, it is converted to UTF-8 encoding.

When editing with notepad, The UTF-8 characters read from the file are converted to unicode characters into memory, after editing is complete, save the Unicode conversion to the UTF-8 save to the file:

When browsing the Web page, the server will convert the dynamically generated Unicode content into a UTF-8 and then transmitted to the browser:

So you can see that the source code of many web pages is similar<meta charset="UTF-8" />The information, indicating that the web page is coded by the UTF-8.

Python string

After figuring out the troublesome character encoding problem, let's study Python's support for Unicode.

Because python was born earlier than the Unicode standard, so the earliest Python only supports ASCII encoding, common strings‘ABC‘In python, all are ASCII encoded. Python provides the ord () and CHR () functions to convert letters and numbers:

>>> ord(‘A‘)65>>> chr(65)‘A‘

Later, Python added support for Unicode.u‘...‘For example:

>>> Print u'chinese'> U' \ u4e2d in U'

WriteU'中'Andu‘\u4e2d‘Is the same,\uIt is followed by a hexadecimal Unicode code. Therefore,u‘A‘Andu‘\u0041‘The same is true.

How do two strings convert each other? String‘xxx‘Although it is ASCII encoding, but can also be seen as UTF-8 encoding, andu‘xxx‘Only unicode encoding is supported.

Setu‘xxx‘Convert to UTF-8-encoded‘xxx‘Useencode(‘utf-8‘)Method:

>>> U'abc '. encode ('utf-8') 'abc'> u'chinese '. encode ('utf-8') '\ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87'

The UTF-8 value after conversion is equal to the Unicode value (but the bucket used is different), and the Unicode Character After conversion is changed to three UTF-8 characters, what you see\xe4Is a byte, because its value is228, No corresponding letter can be displayed, so the byte value is displayed in hexadecimal format.len()The function returns the length of the string:

>>> Len (u'abc') 3 >>> Len ('abc') 3 >>> Len (u'chinese ') 2 >>> Len ('\ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87') 6

In turn, encode the string represented by the UTF-8‘xxx‘Convert to a unicode stringu‘xxx‘Usedecode(‘utf-8‘)Method:

>>> 'Abc '. decode ('utf-8') u'abc'> '\ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87 '. decode ('utf-8') U' \ u4e2d \ u6587 '> Print' \ xe4 \ xb8 \ XAD \ xe6 \ x96 \ x87 '. decode ('utf-8') Chinese

Since Python source code is also a text file, when your source code contains Chinese, when saving the source code, you need to specify to save as UTF-8 encoding. When the python interpreter reads the source code, we usually write these two lines at the beginning of the file to make it read in UTF-8 encoding:

#!/usr/bin/env python# -*- coding: utf-8 -*-

The first line of comment is to tell the Linux/OS X system that this is a python executable program and will be ignored in windows;

The second line of comment is to tell the python interpreter to read the source code according to the UTF-8 encoding, otherwise, the Chinese output you write in the source code may be garbled.

Format

The last common problem is how to output formatted strings. We often output similar'Dear XXX, hello! Your phone bill for XX months is XX, and your balance is xx'And so on, and the content of XXX changes according to the variable. Therefore, a simple format method is required.

In python, the format is the same as that in C.%For example:

>>> ‘Hello, %s‘ % ‘world‘‘Hello, world‘>>> ‘Hi, %s, you have $%d.‘ % (‘Michael‘, 1000000)‘Hi, Michael, you have $1000000.‘

As you may have guessed,%Operators are used to format strings. Inside the string,%sString replacement,%dRepresents the number of Integers to replace.%?Placeholder, followed by several variables or values. The order must be consistent. If only one%?Brackets can be omitted.

Common placeholders include:

% D Integer
% F Floating Point Number
% S String
% X Hexadecimal integer

To Format integers and floating-point numbers, you can also specify whether to fill in the digits of 0 and integer and decimal places:

>>> ‘%2d-%02d‘ % (3, 1)‘ 3-01‘>>> ‘%.2f‘ % 3.1415926‘3.14‘

If you are not sure what to use,%sIt always works and converts any data type to a string:

>>> ‘Age: %s. Gender: %s‘ % (25, True)‘Age: 25. Gender: True‘

For Unicode strings, the usage is the same, but it is best to ensure that the replaced string is also a unicode string:

>>> u‘Hi, %s‘ % u‘Michael‘u‘Hi, Michael‘

Sometimes%What should I do if it is a common character? In this case, you need to escape.%%To indicate%:

>>> ‘growth rate: %d %%‘ % 7‘growth rate: 7 %‘
Summary

Due to historical issues, although Python 2.x supports Unicode‘xxx‘Andu‘xxx‘String representation.

Python also supports other encoding methods, such as encoding Unicode into gb2312:

>>> U'chinese'. encode ('gb2312') '\ xd6 \ xd0 \ xce \ xc4'

However, this method is purely self-defeating. If there are no special business requirements, remember to use only Unicode and UTF-8 encoding methods.

In Python 3. X‘xxx‘Andu‘xxx‘Unified into unicode encoding, that is, do not write the prefixuAll are the same, while strings in bytes must be addedbPrefix:b‘xxx‘.

When formatting strings, you can use the python interactive command line for testing, which is convenient and convenient.

 

String and encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.