python2.x version of the character encoding sometimes makes people very headache, encounter problems, online method can solve the error, but the principle or smattering, this article mainly introduces the principle of string processing in Python, with the output of the JSON file to solve the problem of displaying Chinese instead of Unicode. First, we briefly introduce the history of the string encoding, secondly, the processing of the string, the detection and conversion of the code, and finally, the problem of Chinese output when the JSON data is stored in the Python crawler.
Reference book: Python crawler from beginner to practice by Thalictrum
In Python 2 or 3, there are only two types of string encodings:
(1) Universal Unicode encoding;
(2) Converting Unicode to some type of encoding, such as UTF-8,GBK;
1. Computer History:
The computer processes only numbers, so it must be converted to numbers when processing text.
8 bit (bit) = 1 bytes (byte) = 256 different states = from 000000 to 111111;
1gb=1024m=1024 (1024KB) =1024 (1024x768 (1024b));
ASCII encoding is the relationship between the English characters and the binary numbers; ASCII has a total of 128, such as capital A is 65, that is, 01000001; visible one letter one byte;
GB2312 encoding Simplified Chinese common Encoding, two bytes representing a Chinese character, theoretically 256*256 a code, can represent 65536 kinds of Chinese characters;
Different countries coding, for each country can expand the platform for the conversion and processing of text,Unicode is used as a unified code or a single code. Unicode encoding is typically two bytes, the difference between Unicode and ASCII encoding is that Unicode adds a 0 before ASCII encoding, that is, the ASCII encoding of the letter A is 01000001,unicode encoded as 00000000
01000001; But the English alphabet is only a single byte is enough, Unicode encoding to write in English, a byte more, wasting storage space. Thus Unicode has developed a Universal conversion format (Unicode Transformation Format (UTF)), which is common with utf-8 or utf-16;
2. Python character encoding
Reference Address: https://www.jb51.net/article/139878.htm
(1) The role of encode is to encode Unicode objects into other encoded strings, Str.encode (' utf-8 '), encoded as UTF-8, and (2) decode to convert other encoded strings into Unicode encoding, Str.decode (' UTF-8 ');
- Import Chardet Check the specific encoding type,
chardet.detect(str)
but Str cannot be a Unicode encoding type, but the method does not accept encoded parameters that are already Unicode , there will be typeerror:expected Object of type bytes or ByteArray, got: <type ' Unicode ' > error;
- As a unified standard, Unicode can no longer be decoded, and if UTF-8 wants to go to other non-Unicode, it must (2) first decode to Unicode, encode to other non-Unicode encodings.
When you crawl a Web page, you can see how the page is encoded in F12 elements meta.
(2) Chinese, dictionaries in Python can be serialized into JSON file to be stored in JSON
with open("anjuke_salehouse.json","w",encoding='utf-8') as f: json.dump(all_house,f,ensure_ascii=False,sort_keys=True, indent=4); print(u'加载入文件完成...');
Storing data
- The first parameter of dump () is the object to serialize, the second argument is the open file handle, note that when the file is opened
open()
with UTF-8 encoding Open, at dump()
the time also add ensure_ascii=false, Otherwise it will become ASCII code written to the JSON filejson.dump(all_house,f,ensure_ascii=False,sort_keys=True, indent=4)
Json.dumps ()/json.loads () etc usage
json_str = json.dumps(all_house,ensure_ascii=False); #all——books 为列表、字典等python自带的数据结构,将其写成json#print json_str; #[{"brokername": "王东宇"},{},{}]new_dict = json.loads(json_str);#主要是读json文件时,需要用到#print new_dict; #{u'house_area': u'95', u'build_year': u'2005'}
- Json.dumps () is the conversion of a Python data structure into a JSON-encoded string,
{"Name": "Xiaoming"}
Json.loads () Converts a JSON-encoded string (dictionary form) into a Python data structure, {u ' name ': U ' xiaoming '}
After the dumps conversion, the keys and values become double quotes , and when the loads becomes a Python variable, the elements become single quotes, and the string is added with a U.
General requirements when a string is converted to a Python data type by loads, the outer layer is enclosed in single quotes, with the element key and value enclosed in double quotes.
The difference between dump and dumps
dumps(obj, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, encoding=‘utf-8‘, default=None, sort_keys=False, **kw)
Dump an object into a file, dump needs a parameter similar to a file pointer (not a real pointer, which can be called a class file object), can be combined with the file operation, that is, you can convert the dict to str into the file, as json.dump(all_house,f,ensure_ascii=False,sort_keys=True, indent=4)
in the f
A handle to a JSON file that represents a data to be written to;
dump(obj, fp, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, encoding=‘utf-8‘, default=None, sort_keys=False, **kw)
, and dumps (str) directly to STR, that is, directly to the dictionary to STR, without writing to the file, like a data format conversion method, the Python string into a JSON dictionary.
- So dumps is converting dict into str format, loads is converting str into dict format.
Dump and load are similar functions, just combined with file operations.
(3) Chinese deposit txt
f=open('net_saving_data.txt','w',encoding='utf-8');for item in all_house: # house_area=item['house_area']; # price=item['price']; output='\t'.join([str(item['house_area']),str(item['price']),str(item['build_year']),str(item['house_title'])]); f.write(output); f.write('\n');f.close();
- In the 2.7.15 version of Python, you are prompted with an error
TypeError: ‘encoding‘ is an invalid keyword argument for this function
and cannot pass in the encoding parameter, but in the 3.7 version you can pass in the encoding= ' utf-8 ' parameter to write the TXT in Chinese.
!! NOTE
- The Chinese write txt, json file is nothing more than the open () file, need to add utf-8,Dump (), need to add ensure_ascii=false, prevent ASCII encoding, But just started because the Python version is 2.7.15, not 3.7, causing the storage to be unsuccessful, always thought to be a problem with the code. So the final discovery is the version of the problem, but also quite hurt. There are many questions about Chinese on the internet, but they do not emphasize the Python version of the problem!!! The other 3.XX versions have not been tried.
- Reading the Web page data, viewing the page charset, and Chardet Library of the encoding type of query, timely decode and encode encoding conversion, should be able to avoid a lot of coding problems. The rest of the pit stepped up and mended it.
Python Chinese encoding &json Chinese output problem