Python: Converting between Unicode and normal strings

Source: Internet
Author: User

1.1. Question Problem

You need to deal with data, doesn ' t fit in the ASCII character set.

You need to handle data that is not suitable for the ASCII character set.

1.2. Resolve Solution

Unicode strings can be encoded in plain strings in a variety of ways, according to whichever encoding you choose:

Unicode strings can be encoded in a number of ways as normal strings, according to the encoding you choose (encoding):

<!--Inject Script Filtered--Toggle line numbers

 1  #将Unicode转换成普通的Python字符串: "Encoding (encode)" 
2 unicodestring = u "Hello World"
3 utf8string = unicodestring. encode ("Utf-8")
4 asciistring = unicodestring. encode ("ASCII")
5 isostring = unicodestring. encode ("iso-8859-1")
6 utf16string = unicodestring. encode ("utf-16")
7
8
9 #将普通的Python字符串转换成Unicode: "Decoding (decode)"
Ten plainstring1 = Unicode(utf8string, " Utf-8 ")
plainstring2 = Unicode( asciistring , "ASCII")
plainstring3 = Unicode(isostring, " Iso-8859-1 ")
plainstring4 = Unicode( utf16string , "utf-16")
-
assert plainstring1==plainstring2= =plainstring3 = =plainstring4
1.3. Discussion Discussion

If you find the yourself dealing with the text that contains non-ascii characters, and you had to learn about Unicode 梬 hat it was, how It works, and how Python uses it.

If you find yourself working on text that contains non-ASCII characters, you must learn Unicode, about what it is, how it works, and how Python uses it.


unicode is a big topic. Luckily, you don ' t need to know everything about Unicode to be able to solve real-world problems with it:a few basic bits of knowledge is enough. First, you must understand the difference between bytes and characters. In older, Ascii-centric languages and environments, bytes and characters is treated as the same thing. Since a byte can hold up to the values, these environments is limited to the characters. Unicode, on the other hand, has tens of thousands of characters. That means. Each Unicode character takes more than one byte, so you need to make the distinction between characters an D bytes.

Unicode is a large topic. Fortunately, you don't need to know everything about Unicode code, and you can use it to solve real-world problems: Some basic knowledge is enough. First, you need to understand the difference between a byte and a character. Originally, in ASCII-centric languages and environments, bytes and characters were treated as the same thing. Because a byte can have only 256 values, these environments are limited to only 256 characters. Unicode code, on the other hand, has tens of thousands of characters, which means that each Unicode character occupies multiple bytes, so you need to make a distinction between characters and bytes.


Standard python strings was really byte strings, and a python character is really a byte. Other terms-the standard Python type is "8-bit string" and "plain string.", in this recipe we'll call them byte stri NGS, to remind you of their byte-orientedness.

The standard Python string is indeed a byte string, and a python character really is a byte. In a different term, the standard Python string type is "8-bit string (8-bit string)" and "Normal string (plain string)". In this recipe we call them byte strings (byte strings) and remember that they are byte-based.


conversely, a Python Unicode character is an abstract object big enough to the character, analogous to Python ' s Long integers. You don ' t has to worry about the internal representation;the representation of Unicode characters becomes a issue only W hen you is trying to send them to some byte-oriented function, such as the "Write Method for Files" or the "Send method for" Network sockets. At this point, you must choose how to represent the characters as bytes. Converting from Unicode to a byte string is called encoding the string. Similarly, when you load Unicode strings from a file, socket, or other byte-oriented object, you need to decode the string s from bytes to characters.

Conversely, a Python Unicode code character is an abstract object that is large enough to support (Unicode) characters, similar to a long integer in Python. You don't have to worry about your inner expression. The representation of Unicode characters becomes an issue when you are trying to pass them to some byte-based functions, such as the Write method of a file or the Send method of a network socket. At that point, you have to choose how to represent these (Unicode) characters as bytes. The conversion from Unicode code to byte string is called encoding. Similarly, when you load a Unicode string from a file, socket, or other byte-based object, you need to decode the byte string to a (Unicode) character.


There is many ways of converting Unicode objects to byte strings, each of the which is called an encoding. For a variety of historical, political, and technical reasons, there are no one "right" encoding. Every encoding have a case-insensitive name, and that name are passed to the Decode method as a parameter. Here is a few you should know about:

There are many ways to convert Unicode code objects into byte strings, each called an encoding (encoding). Because of many historical, political, and technical reasons, there is no "right" coding. Each encoding has a case-insensitive name, and that name is passed as a three-digit decoding method. Here are some of the things you should know:

  • The UTF-8 encoding can handle any Unicode character. It is also backward compatible with ASCII, so a pure ASCII file can also being considered a UTF-8 file, and a UTF-8 file that Happens to use only ASCII characters are identical to a ASCII file with the same characters. This property makes UTF-8 very backward-compatible, especially with older Unix tools. UTF-8 is far and away the dominant encoding on unix.it ' s primary weakness are that It's fairly inefficient for Eastern Tex Ts.

  • The UTF-8 encoding can handle any Unicode character. It is also backwards compatible with ASCII code, so a purely ASCII file can also be considered as a UTF-8 file, and an UTF-8 file that happens to only use ASCII characters is the same as an ASCII code file that has the same characters. This feature makes UTF-8 backwards compatible, especially when using older UNIX tools. UTF-8 is undoubtedly the dominant encoding on Unix. Its main weakness is that it is very inefficient for oriental writing.

  • The UTF-16 encoding is favored by Microsoft operating systems and the Java environment. It's less efficient for Western languages and more efficient for Eastern ones. A variant of UTF-16 is sometimes known as UCS-2.

  • UTF-16 encoding is favored by Microsoft's operating system and Java environment. It is relatively inefficient for western languages, but more efficient for eastern languages. A variant of UTF-16 is sometimes called UCS-2.

  • The ISO-8859 series of encodings is 256-character ASCII supersets. They cannot support any of the Unicode Characters;they can support is only some particular language or family of languages. Iso-8859-1, also known as Latin-1, covers most Western European and African languages, but not arabic.iso-8859-2, also kno WN as latin-2,covers many Eastern European languages such as Hungarian and Polish.

  • The ISO-8859 encoding series is a superset of 256-character ASCII codes. They are not able to support all Unicode code characters; They can only support a special language or language family. Iso-8859-1, which is also Latin-1, includes most of the Western European and African languages, but does not contain Arabic. Iso-8859-2, also Latin-2, includes many eastern European languages, such as Hungarian and Polish.

If you want to is able to encode all Unicode characters, you probably want to use UTF-8. You'll probably need to deal and the other encodings if you were handed data in those encodings created by some O Ther application.

If you want to be able to encode all Unicode code characters, you may want to use UTF-8. You may need to process other encodings only if you need to deal with other encoded data that is generated by other applications.


Python: Converting between Unicode and normal strings

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.