Errors and format conversions caused by Unicode and non-Unicode codes in Python

Last Update:2014-10-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.1. Question Problem

You need to deal with data, doesn ' t fit in the ASCII character set.

You need to handle data that is not suitable for the ASCII character set.

1.2. Resolve Solution

Unicode strings can be encoded in plain strings in a variety of ways, according to whichever encoding you choose:

Unicode strings can be encoded in a number of ways as normal strings, according to the encoding you choose (encoding):

<!--Inject Script Filtered--Toggle line numbers

1#将Unicode转换成普通的Python字符串: "Encoding (encode)"
2UnicodeString=U "Hello World"
3Utf8string=UnicodeString.Encode("Utf-8")
4Asciistring=UnicodeString.Encode("ASCII")
5Isostring=UnicodeString.Encode("Iso-8859-1")
6Utf16string=UnicodeString.Encode("Utf-16")
7
8
9#将普通的Python字符串转换成Unicode: "Decoding (decode)"
10Plainstring1=Unicode(Utf8string,"Utf-8")
11Plainstring2=Unicode(Asciistring,"ASCII")
12Plainstring3= unicode (isostring "iso-8859-1" ) 
   plainstring4 = unicode (utf16string "utf-16"    assert plainstring1== Plainstring2==plainstring3== Plainstring4

1.3. Discussion Discussion

If you find the yourself dealing with the text that contains non-ascii characters, and you had to learn about Unicode 梬 hat it was, how It works, and how Python uses it.

If you find yourself working on text that contains non-ASCII characters, you must learn Unicode, about what it is, how it works, and how Python uses it.

unicode is a big topic. Luckily, you don ' t needto know everything about Unicode to be able to solve real-worldproblems with it:a few basic bits O F knowledge is enough. First,you must understand the difference between bytes and characters. Inolder, Ascii-centric languages and environments, bytes Andcharacters is treated as the same thing. Since a byte can hold upto values, these environments is limited to 256characters. Unicode, on the other hand, has tens of thousands ofcharacters. That means. Each Unicode character takes more thanone byte, so you need to make the distinction between charactersand bytes.

Unicode is a large topic. Fortunately, you don't need to know everything about Unicode code, and you can use it to solve real-world problems: Some basic knowledge is enough. First, you need to understand the difference between a byte and a character. Originally, in ASCII-centric languages and environments, bytes and characters were treated as the same thing. Because a byte can have only 256 values, these environments are limited to only 256 characters. Unicode code, on the other hand, has tens of thousands of characters, which means that each Unicode character occupies multiple bytes, so you need to make a distinction between characters and bytes.

Standard python strings was really byte Strings,and a python character is really a byte. Other terms for Thestandard Python type is "8-bit string" and "plain string.", in Thisrecipe we'll call them byte string s, to remind you of theirbyte-orientedness.

The standard Python string is indeed a byte string, and a python character really is a byte. In a different term, the standard Python string type is "8-bit string (8-bit string)" and "normal string (plainstring)". In this recipe we call them byte strings (byte strings) and remember that they are byte-based.

conversely, a Python Unicode character is Anabstract object big enough to hold the character, analogous Topython ' s long integers. You don ' t has to worry about the internalrepresentation;the representation of Unicode characters becomes anissue only whe n You is trying to send them to some byte-orientedfunction, such as the Write method for files or the Send method fornetw ORK sockets. At the point, you must choose how to represent Thecharacters as bytes. Converting from Unicode to a byte string iscalled encoding the string. Similarly, when you load Unicode stringsfrom a file, sockets, or other byte-oriented object, you need todecode the strings from bytes to characters.

Conversely, a Pythonunicode code character is an abstract object that is large enough to support (Unicode) characters, similar to a long integer in Python. You do not need to worry about the inside, only if you are trying to pass them to some byte-based functions, the representation of the Unicode character becomes an issue, such as the Write method of the file or the Send method of the network socket. At that point, you have to choose how to represent these (Unicode) characters as bytes. The conversion from Unicode code to byte string is called encoding. Similarly, when you load a Unicode string from a file, socket, or other byte-based object, you need to decode the byte string to a (Unicode) character.

There was many ways of converting unicodeobjects to byte strings, each of which was called an encoding. For avariety of historical, political, and technical reasons, there Isno one "right" encoding. Every encoding have a case-insensitive name,and that name is passed to the Decode method as a parameter. Hereare a few you should know about:

There are many ways to convert Unicode code objects into byte strings, each called an encoding (encoding). Because of many historical, political, and technical reasons, there is no "right" coding. Each encoding has a case-insensitive name, and that name is passed as a three-digit decoding method. Here are some of the things you should know:

The UTF-8 encoding can handle any Unicode character. It is alsobackward compatible with ASCII, so a pure ASCII file can also beconsidered a UTF-8 file, and a UTF-8 file that H Appens to use ONLYASCII characters are identical to a ASCII file with the samecharacters. This property makes UTF-8 very backward-compatible,especially with older Unix tools. UTF-8 is far and away the dominantencoding on unix.it ' s primary weakness are that It's fairlyinefficient for Eastern texts .
The UTF-8 encoding can handle any Unicode character. It is also backwards compatible with ASCII code, so a purely ASCII file can also be considered as a UTF-8 file, and an UTF-8 file that happens to only use ASCII characters is the same as an ASCII code file that has the same characters. This feature makes UTF-8 backwards compatible, especially when using older UNIX tools. UTF-8 is undoubtedly the dominant encoding on UNIX. Its main weakness is that it is very inefficient for oriental writing.
The UTF-16 encoding is favored by microsoftoperating systems and the Java environment. It's less efficient forwestern languages and more efficient for Eastern ones. A Variant ofUTF-16 is sometimes known as UCS-2.
UTF-16 encoding is favored by Microsoft's operating system and Java environment. It is relatively inefficient for western languages, but more efficient for eastern languages. A variant of UTF-16 is sometimes called UCS-2.
The ISO-8859 series of encodings is 256-characterascii supersets. They cannot support any of the Unicodecharacters;they can support is only some particular language or familyof languages. Iso-8859-1, also known as Latin-1, covers most Westerneuropean and African languages, but not arabic.iso-8859-2, Alsoknown As latin-2,covers many Eastern European languages such Ashungarian and Polish.
The ISO-8859 encoding series is a superset of 256-character ASCII codes. They are not able to support all Unicode characters; they can only support special languages or language families. Iso-8859-1, which is also Latin-1, includes most of the Western European and African languages, but does not contain Arabic. Iso-8859-2, also Latin-2, includes many eastern European languages, such as Hungarian and Polish.

If you want to is able to encode all unicodecharacters, you probably want to use UTF-8. You'll probably needto deal with the other encodings only if you were handed data inthose encodings created by some oth ER application.

If you want to be able to encode all Unicode code characters, you may want to use UTF-8. You may need to process other encodings only if you need to deal with other encoded data that is generated by other applications.

Errors and format conversions caused by Unicode and non-Unicode codes in Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Errors and format conversions caused by Unicode and non-Unicode codes in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Errors and format conversions caused by Unicode and non-Unicode codes in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support