Python Coding issues
The representation of a string inside Python is Unicode encoding, so in encoding conversion, it is usually necessary to use Unicode as the intermediate encoding , that is, decoding the other encoded string (decode) into Unicode first. From Unicode encoding (encode) to another encoding.
The role of Decode is to convert other encoded strings into Unicode encodings, such as Str1.decode (' gb2312 '), to convert gb2312 encoded string str1 into Unicode encoding.
The role of encode is to convert Unicode encoding into other encoded strings, such as Str2.encode (' gb2312 '), to convert Unicode encoded string str2 to gb2312 encoding.
Therefore, the transcoding must first understand that the string str is what encoding, and then decode into Unicode, and then encode into the other encoding code in the string's default encoding is consistent with the code file itself.
The following example is troubling for a long time, in doing a reptile site example. General page encoding is utf-8, Windows terminal is GBK?
if platform.system()=="Windows":
kw = raw_input("请输入关键字(多个关键字请以空格隔开):".decode("utf-8").encode("gbk"))
kw = kw.decode("gbk").encode("utf-8")
Note: KW is the Chinese input keyword, is to be submitted to the page you want to crawl, so you want to convert to utf-8 encoding, the first is already GBK encoded kwdecode into Python internal Unicode encoding, and then encode the Unicode encoding into a Web page utf-8 encoded string.
Character Coding knowledge carding
UTF-8 is the most widely used type of Unicode implementation on the Internet.
- Coding rules for UTF-8
1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.
2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.
All in all: interpreting UTF-8 coding is very simple. If the first bit of a byte is 0, then the byte is a single character, and if the first bit is 1, how many consecutive 1 is the number of bytes that the current character occupies.
String encoding Issues