Python Coding knowledge
- Coding Introduction
- Kinds
- Storage units
1. Introduction to Coding
Encoding is the process of converting information from one form or format to another, also known as a code abbreviation for a computer programming language. The text, numbers, or other objects are digitally digitized by a predetermined method, or the information or data is converted into a prescribed electrical pulse signal.
2. Type
A.ascii (8-bit)
The most widely used character set and its encoding in the current computer are ASCII codes developed by the U.S. National Standards Agency (ANSI). Applies to all Latin letters, ASCII code has 7-bit code and 8-bit code two forms.
We know that inside the computer, all the information is ultimately represented as a binary string. Each bits (bit) has 0 and 12 states, so eight bits can combine 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 0000000 to 11111111.
The ASCII code specifies a total of 128 characters of the code, the 128 symbols (including 32 can not be printed out of the control symbols), only a byte of the back 7 bits, the first 1-bit uniform set is 0.
Cons: It's enough to encode 128 symbols in English, but 128 symbols are not enough to represent other languages. For example, in French, where there is a phonetic symbol above the letter, it cannot be represented by an ASCII code. As a result, some European countries decided to use the highest bits of the bytes that were idle to incorporate new symbols. For example, the code for E in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent a maximum of 256 symbols. However, there are new problems. Different countries have different letters, so even if they are encoded using 256 symbols, the letters are not the same. For example, 130 is represented in the French code, but in Hebrew it represents the letter Gimel (?), and in the Russian language, another symbol is represented in the code. But anyway, in all of these encodings, 0-127 represents the same symbol, and the difference is just 128-255 of this paragraph. As for Asian countries, the use of symbols is more, the Chinese character is about 100,000. A byte can represent only 256 symbols, which is certainly not enough, and must be expressed using multiple bytes to express a symbol. For example, the common encoding method in Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it is theoretically possible to represent a maximum of 256x256=65536 symbols.
B.unicode (32-bit)
There are many coding methods in the world, and the same binary numbers can be interpreted into different symbols. Therefore, if you want to open a text file, you must know its encoding, or in the wrong way to interpret the code, there will be garbled. Why do e-mails often appear garbled? It is because the sender and the recipient are using different encoding methods.
It can be imagined that if there is a code, all the symbols in the world are included. Each symbol is given a unique encoding, then the garbled problem disappears. This is Unicode, as its name indicates, which is an encoding of all symbols.
Cons: wasting memory; we already know that the English alphabet is only one byte to express enough, if Unicode uniform rules, each symbol with three or four bytes, then each English letter must have two to three bytes is 0, which is a great waste of storage, The size of the text file is therefore two or three times times larger, which is unacceptable.
C.utf8 (English: 8-bit European script: 16-bit kanji: 24-bit)
UTF-8 is the most widely used form of Unicode implementation on the Internet. One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.
The coding rules for UTF-8 are simple, with only two lines:
1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.
2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.
D.GBK (16-bit)
GBK is China's national code, the versatility is worse than UTF8, but UTF8 occupies more space than GBK.
UTF8 is an international code, its versatility is better, foreigners can also browse the forum, and Chinese can be directly recognized, if your forum to do more internationalization that must be used UTF8.
3. Storage units
The minimum computer unit is byte
1byte = 8bit
1KB = 1024byte
1MB = 1024KB
1GB = 1024MB
1TB = 1024GB
Python path "No. 03 chapter": Python Coding knowledge