Python: Converting between Unicode and normal strings

Source: Internet
Author: User

Unicode strings can be encoded in a number of ways as normal strings, according to the encoding you choose (encoding):<!--Inject Script Filtered--Toggle Line Numbers1#将Unicode转换成普通的Python字符串:"encoding (encode)"   2unicodestring = u"Hello World"   3utf8string = Unicodestring.encode ("Utf-8")   4asciistring = Unicodestring.encode ("ASCII")   5isostring = Unicodestring.encode ("iso-8859-1")   6utf16string = Unicodestring.encode ("utf-16")   7    8    9#将普通的Python字符串转换成Unicode:"decoding (decode)"  TenPlainstring1 = Unicode (utf8string,"Utf-8")   OnePlainstring2 = Unicode (asciistring,"ASCII")   APlainstring3 = Unicode (isostring,"iso-8859-1")   -Plainstring4 = Unicode (utf16string,"utf-16")   -    theAssert plainstring1==plainstring2==plainstring3==Plainstring4
discussion discussionif you find yourself dealing with text that contains non-ascii characters, you had to learn about Unicode 梬 hat it is, how is it works, and how Python uses it. If you find yourself working on text that contains non-ASCII characters, you must learn Unicode, about what it is, how it works, and how Python uses it. Unicode isA big topic. Luckily, you don't need to know everything on Unicode to is able to solve real-world problems with it:a few basic bits of knowledge Is enough. First, you must understand the difference between bytes and characters. In older, Ascii-centric languages and environments, bytes and characters is treated as the same thing. Since a byte can hold up to the values, these environments is limited to the characters. Unicode, on the other hand, has tens of thousands of characters. That means. Each Unicode character takes more than one byte, so you need to make the distinction between characters an D bytes.Unicode is a large topic. Fortunately, you don't need to know everything about Unicode code, and you can use it to solve real-world problems: Some basic knowledge is enough. First, you need to understand the difference between a byte and a character. Originally, in ASCII-centric languages and environments, bytes and characters were treated as the same thing. Because a byte can have only 256 values, these environments are limited to only 256 characters. Unicode code, on the other hand, has tens of thousands of characters, which means that each Unicode character occupies multiple bytes, so you need to make a distinction between characters and bytes. Standard Python strings is reallybyteStrings, and a Python character isReally abyte. Other terms forThe standard Python type is"8-bit String"and"Plain String.", in ThisRecipe we'll call thembytestrings, to remind you of theirbyte-orientedness. The standard Python string is indeed a byte string, and a python character really a byte. In a different term, the standard Python string type is"8-bit string (8-bit string)"And"Normal string (plain string)". In this recipe we call them a byte string (bytestrings), and remember that they are byte-based. Conversely, a Python Unicode character isAnAbstract ObjectBig enough to hold the character, analogous to Python's long integers. You don'T has to worry about theInternalRepresentation;the representation of Unicode characters becomes an issue if you were trying to send them to somebyte-oriented function, such asThe Write method forFiles or the Send method forNetwork sockets. At this point, you must choose how to represent the characters asbytes. Converting fromUnicode to abyte string isCalled Encoding thestring. Similarly, when you load Unicode strings fromA file, socket, or otherbyte-orientedObject, you need to decode the strings frombytes to characters. Conversely, a Python Unicode code character is an abstract object that is large enough to support (Unicode) characters, similar to a long integer in Python. You don't have to worry about your inner expression. The representation of Unicode characters becomes an issue when you are trying to pass them to some byte-based functions, such as the Write method of a file or the Send method of a network socket. At that point, you have to choose how to represent these (Unicode) characters as bytes. The conversion from Unicode code to byte string is called encoding. Similarly, when you load a Unicode string from a file, socket, or other byte-based object, you need to decode the byte string to a (Unicode) character. There is many ways of converting Unicode objects tobytestrings, each of the which isCalled an encoding. For a variety of historical, political, and technical reasons, there isNo one" Right"Encoding. Every encoding has a Case-insensitive name, and that name isPassed to the Decode method asa parameter. Here is a few you should know about: There are many ways to convert a Unicode code object into a byte string, each called an encoding (encoding). Because of a variety of historical, political, and technical reasons, no one"the right"encoding. Each encoding has a case-insensitive name, and that name is passed as a three-digit decoding method. Here are some of the things you should know: the UTF-8Encoding can handle any Unicode character. It isAlso backward compatible with ASCII, so a pure ASCII file can also be considered a utf-8File, and a utf-8File, happens to use only ASCII characters isIdentical to a ASCII file with the same characters. This property makes utf-8Very backward-compatible, especially with older Unix tools. utf-8 isFar and away the dominant encoding on unix.it's primary weakness is that it's fairly inefficient for Eastern texts.utf-8The encoding can handle any Unicode character. It is also backwards compatible with ASCII code, so a purely ASCII file can also be considered as a utf-8File, and a utf-that happens to use only ASCII characters8The file is the same as the ASCII code file that has the same character. This feature makes UTF-8 backwards compatible, especially when using older UNIX tools. utf-8is undoubtedly the dominant encoding on Unix. Its main weakness is that it is very inefficient for oriental writing. The UTF- -Encoding isFavored by Microsoft operating systems and the Java environment. It isLess efficient forWestern languages but more efficient forEastern ones. A variant of utf- - isSometimes known asucs-2. UTF- -Coding is favored in Microsoft's operating system and Java environment. It is relatively inefficient for western languages, but more efficient for eastern languages. A utf- -Variants are sometimes called ucs-2. The ISO-8859Series of encodings is the-character ASCII supersets. They cannot support any of the Unicode Characters;they can support is only some particular language or family of languages. iso-8859-1, also known aslatin-1, covers most Western European and African languages, but not arabic.iso-8859-2, also known aslatin-2, covers many Eastern European languages such asHungarian and Polish. ISO-8859 encoding series is a superset of 256-character ASCII codes. They are not able to support all Unicode code characters; They can only support a special language or language family. iso-8859-1, but also both latin-1, including most of the Western European and African languages, but not Arabic. iso-8859-2, but also both latin-2, including many Eastern European languages, such as Hungarian and Polish. If you want to is able to encode all Unicode characters, you probably want to use UTF-8. You'll probably need to deal and the other encodings if you are handed datainchthose encodings created by some other application. If you want to be able to encode all Unicode code characters, you might want to use UTF-8。 You may need to process other encodings only if you need to deal with other encoded data that is generated by other applications.

Python: Converting between Unicode and normal strings

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.