Character encoding, in programming, is a let learners more depressed things, such as a str, if all is English, say more. But that is not the case, Chinese is what we have to use. So, even beginners, understand and can solve the problem of character encoding.
>>> name = ' Old qi '
' \xe8\x80\x81\xe9\xbd\x90 '
In your programming, have you encountered the above situation? Do you know what's printed on the bottom line? It's better to look at someone else's English.
>>> name = "Qiwsir"
' Qiwsir '
Is this the fault of Chinese? It seems that reincarnation is really a technical work. Yes, reincarnation is a technical job, but the question above is not Chinese fault.
What is coding? This is a relatively iffy problem. It's not the next normal definition. I see some textbooks have a definition, dare not say that his definition is not correct, at least it can be said not easy to understand.
Ancient wars, drumming, Jankin, this is coding. The orders to be communicated to the soldiers correspond to certain other forms, such as the command "offense", after such information is transmitted:
The officer orders the attack, the Herald encodes the command as a drum (if it's complicated, how many drums and how to attack?) ）。
Drums spread in the air, more distant than the sound of a herald's voice, and the soldiers heard it without ambiguity, and soldiers generally did not think of drums as snoring. This is the advantage of the "offense" command being encoded into drums.
When soldiers hear drums, they receive information, and if they are trained or told, they will know that this is the way to attack. The process is decoding. Therefore, the coding scheme should have two sets. A set of information providers there, in addition set on information recipients here. After decoding, the soldier understood, before he acted.
The above process is relatively simple. In fact, the real coding and decoding process, to be complex. However, the principle is the same.
Give an example of something that seems distant, in fact, that people have been using for a long time: Telegraph
Copy Code code as follows:
The Telegraph, a kind of communication business, was invented in the early 19th century and was the first way to use electricity to communicate. The Telegraph was one of the most important inventions in the industrial society to expedite the flow of information. Earlier telegrams could only be communicated on land, then submarine cables were used and ocean services were carried out. By the beginning of the 20th century, the Telegraph service had largely reached most of the Earth by Radio Telegraph. Telegrams are mainly used for transmitting text messages, and telegraph technology is used for transmitting pictures called faxes. The first telegraph line in China was 1871, which was laid by the British, Russian and Danish submarine cables from Hong Kong via Shanghai to Nagasaki, Japan. The cables were barred from landing in Shanghai because of the Qing government's objections. The Danish company then ignored the government's ban, led the route to the Shanghai public concession and began sending and receiving telegrams from June 3. The first autonomous route was built by Fujian Governor Ding Richang in Taiwan and completed in October 1877, connecting Tainan and Kaohsiung. In 1879, Li Hongzhang, the Beiyang minister, set up telegraph lines between Tianjin, Dagu and Beitang for use as military communications. In 1880, Li Hongzhang was allowed to set up the General Telegraph office, which was Xuanhuai. And in December 1881, the opening of the Tianjin to Shanghai Telegraph service. Li Hongzhang said: "Over the past five years, our country has created a river along the coast of the provincial power lines, a total of more than 10,000, the state fee is not much, the money from the folk." At that time was a legal person provocation, Generals report military, the court conveyed instructions, all Sang, without any hindrance. China has never been so swift in its ancient military tactics. To minister to the question and answer, Chiva, miles apart like a living together courtyard. Three to set the telegram, not only to prevent foreign aggression, but also strengthen national defense, also conducive to business. Tianjin official Electricity Bureau in the Geng son is destroyed all over the chaos. In 1887, Taiwan Governor Mingchuan a submarine cable from Fuzhou to Taiwan, the first submarine cable in China. 1884, the Beijing Telegraph began to build, the use of "installation of two lines, from Tongzhou exhibition to the capital, to one end of the introduction of the department, the post to the end of the choice to place the merchant", the same year August 5, Telegraph Line began construction, all telephone poles are painted red. August 22, Uptown commercial Telegraph Office opened at the Magpie Hutong in the west of Chongwen Gate outside Beijing. The same year August 30, is located in Chongwen door bubble Son and the west of the LU courtroom opening, special send and receive official telegram.
in order to convey the Chinese characters, the Telegraph department prepares a code consisting of 4 digits or 3 Roman characters, that is, Chinese coded, which is used to rewrite the Chinese characters into codes before sending, and then rewrite the codes into Chinese characters after the telegram is received.
You reader note that there is a telegram used in the "Chinese code", which is a code, the Chinese characters to correspond to Arabic numerals, so can send Chinese characters by Telegraph.
1873, the French personnel Wickige Reference "Kangxi Dictionary" Radical Arrangement method, selected more than 6,800 commonly used Chinese characters, compiled the first Chinese character code this "new telegram".
The code in the telegram is called Morse Code, English is Morse code
Morse code (English: Morse code) is a kind of time-pass signal codes, which express different English letters, numerals and punctuation marks in different order. Was invented by the American Samuel Morse in 1836.
Morse code is an early form of digital communication, but it differs from modern binary codes using only 0 and 12 states, its code consists of five dots (.), strokes (-), a short pause between each character (a pause between points and strokes), a medium pause between each word, and a long pause between sentences.
It seems that Telegraph Clerk is a technical work, and the pauses of different lengths represent different meanings. Oh, yes, there is an old film "Never Dies of the airwaves", after reading to ensure that you know, there is no telling how the Telegraph is encoded.
Morse code has been used as an international standard in maritime communications for up to 1999 years. In 1997, when the French Navy stopped using Morse code, the last message sent was: "All attention, this is our last cry before forever silence!" ”
I stared at the old long time, these two lines are not the same?
Regardless of this, in short, this is the code.
Character encoding in the computer
Copy Wikipedia's interpretation of character encoding:
Copy Code code as follows:
Character encoding (English: Character encoding), which encodes characters in character sets into an object in a specified collection (for example, bit mode, serial number of natural numbers, 8-bit groups, or electrical pulses), so that text is stored in the computer and passed through the communication network. Common examples include encoding the Latin alphabet Chengmose code and ASCII. Where ASCII numbers letters, numbers, and other symbols and represents the integer with a 7-bit binary. Typically, an additional bit is used in order to be stored in a 1-byte manner.
In the early days of computer technology development, character sets such as ASCII (1963) and EBCDIC (1964) became standard gradually. But the limitations of these character sets quickly became apparent, and many methods have been developed to extend them. Requirements for supporting writing systems, including the East Asian CJK character family, can support a larger number of characters and require a system rather than a temporary method for encoding these characters.
In this world, there are a lot of different character encodings. But they're not doing it on their own. But to have a certain basis, often based on the code called ASCII, here should also include North Korea Bar (do not know what characters they use to encode, blind thinking, do not take seriously, do not represent the textbook position, only to represent the blind).
ASCII (pronunciation: English pronunciation:/ˈæski/ass-kee1,american Standard Code for information interchange, The American Standard Code for Information Interchange is a set of computer coding systems based on the Latin alphabet. It is mainly used to display modern English, and its extended version Eascii can partially support other Western European languages and is equivalent to international standard ISO/IEC 646. Since the World Wide Web makes ASCII widely known, it was gradually replaced by Unicode until December 2007.
The quote above has already said, now we use the coding standard, is not ASCII, I went to college at that time the teacher was talking about ASCII (the most pit dad is your university education, a few days ago interviewing a university graduate, computer professional, He told me his teacher gave them the ASCII code standard, I said you do not maitai the teacher, you go to see the textbook, today this buddy really sent me a text message, tell me the textbook is so said. , the times have changed and now it has become Unicode, so what is Unicode encoding? Or copy a note from Wikipedia (to be clear, this is not my qiwsir speaking, Wikipedia is speaking, I'm just a supporting role, haha)
Unicode (Chinese: Universal Code, International Code, unified Code, single code) is an industry standard in the field of computer science. It collates and encodes most of the world's text systems, allowing computers to render and process text in simpler ways.
Unicode develops with the standard of the universal Character set and is published in the form of books. Unicode is still being built up and more new characters are added to each new version. The latest version is 7.0.0, which has earned more than 100,000 characters (the 100,000th character was adopted in 2005). Unicode covers data in addition to visual glyphs, encoding methods, and standard character encodings, including character attributes such as uppercase and lowercase letters.
Listen to this name: the universal code, that must contain the Chinese. It's true. However, it is not possible to have a Unicode in light because .... (There are several words omitted here, reader can go to the Wikipedia connection given above), there are other coding implementations, Unicode is implemented in the form of Unicode conversion format (Unicode Transformation format, referred to as UTF), Then there is a utf-8 that we will see many times.
What is Utf-8, or what does Wikipedia say?
Copy Code code as follows:
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode and a prefix code. It can be used to represent any character in the Unicode Standard, and the first byte in its encoding is still compatible with ASCII, which allows the software that originally handles ASCII characters to be used without or requiring only a few modifications. As a result, it gradually becomes the preferred encoding for e-mail, Web pages and other applications that store or send text.
No more quotes, if you want to see more, please go to the original.
Reader now is not understanding, the previous writing program, there have been: coding:utf-8 words. Is telling Python what character we're going to encode.
Encode and Decode
The history part is finished, then how to say? It's rather troublesome. For whatever it is, it is not words to speak clearly. Let's start with the Encode () and decode () two built-in functions.
Codecs.encode (obj[, encoding[, errors]): Encodes obj using the codec registered for encoding.
Codecs.decode (obj[, encoding[, errors]):D Ecodes obj using the codec registered for encoding.
Python2 The default encoding is ASCII, encode can convert an object's encoding to the specified encoding format, and decode is the inverse of the process.
To do an experiment to understand:
Copy Code code as follows:
>>> a = "Medium"
>>> type (a)
<type ' str ' >
' \xe4\xb8\xad '
>>> Len (a)
>>> B = A.decode ()
U ' \u4e2d '
>>> type (b)
<type ' Unicode ' >
>>> Len (b)
This experiment does not do before, perhaps reader is not very confused (because do not know, know more the more confused), the experiment is done, oneself also confused. Don't be impatient, to the understanding of the coding problem, to slowly, if the time is not understood, also certainly cannot understand, first pay attention to do according to the request, do to do is suddenly enlightened.
In the experiment above, variable a refers to a string, called a string (str), which is strictly a byte string, which is a sequence of encoded bytes. What you see in the experiment above is the byte representation of the word "medium" encoded in the computer. (for bytes, reader can be Google). Use Len (a) to measure its length, which is made up of three bytes.
The byte string is then converted to a string by the Decode function, and the string is encoded in Unicode. In Unicode encoding, one character for a Chinese character is measured at a length of 1.
Conversely, a Unicode-encoded string can also be converted to a byte string.
Copy Code code as follows:
>>> C = b.encode (' Utf-8 ')
' \xe4\xb8\xad '
>>> type (c)
<type ' str ' >
>>> C = A
About coding, come here first, go to the point. Because of that, it's going to pull out a problem. Reader must not be satisfied, because it is not known why. It doesn't matter, please Google, you can solve.
How to avoid Chinese is garbled in Python
This problem is a highly operational problem. I have a summary of the experience here, share it for reference:
First, it advocates the use of the Utf-8 encoding scheme because it is good across platforms.
Experience one: In the opening statement:
A friend asked me what the role of-*-, that is to look good, the heart of beauty people have, not to mention programmers? Of course, you can also write:
Experience Two: Encounter character (section) strings, immediately converted to Unicode, do not use STR (), directly using Unicode ()
UNICODE_STR = Unicode (' Chinese ', encoding= ' utf-8 ')
Print Unicode_str.encode (' Utf-8 ')
Experience Three: If the file operation, open the file, it is best to use Codecs.open, instead of open (this will be mentioned, put here first)
Codecs.open (' filename ', encoding= ' UTF8 ')
I also collected a piece of online article, also very good, recommended to reader: python2.x Chinese Display method
Finally told me that if you use Python3, Pit Dad's coding problem will not worry.