Python Chinese encoding &json Chinese output problem

Source: Internet
Author: User
Tags processing text


python2.x version of the character encoding sometimes makes people very headache, encounter problems, online method can solve the error, but the principle or smattering, this article mainly introduces the principle of string processing in Python, with the output of the JSON file to solve the problem of displaying Chinese instead of Unicode. First, we briefly introduce the history of the string encoding, secondly, the processing of the string, the detection and conversion of the code, and finally, the problem of Chinese output when the JSON data is stored in the Python crawler.
Reference book: Python crawler from beginner to practice by Thalictrum



In Python 2 or 3, there are only two types of string encodings:



(1) Universal Unicode encoding;
(2) Converting Unicode to some type of encoding, such as UTF-8,GBK;


1. Computer History:


The computer processes only numbers, so it must be converted to numbers when processing text.
8 bit (bit) = 1 bytes (byte) = 256 different states = from 000000 to 111111;
1gb=1024m=1024 (1024KB) =1024 (1024x768 (1024b));



ASCII encoding is the relationship between the English characters and the binary numbers; ASCII has a total of 128, such as capital A is 65, that is, 01000001; visible one letter one byte;





GB2312 encoding Simplified Chinese common Encoding, two bytes representing a Chinese character, theoretically 256*256 a code, can represent 65536 kinds of Chinese characters;


Different countries coding, for each country can expand the platform for the conversion and processing of text, Unicode is used as a unified code or a single code. Unicode encoding is typically two bytes, the difference between Unicode and ASCII encoding is that Unicode adds a 0 before ASCII encoding, that is, the ASCII encoding of the letter A is 01000001,unicode encoded as 00000000 01000001; But the English alphabet is only a single byte is enough, Unicode encoding to write in English, a byte more, wasting storage space. Thus Unicode has developed a Universal conversion format (Unicode Transformation Format (UTF)), which is common with utf-8 or utf-16;

2. Python ch aracter encoding


Reference Address: www.jb51.net/article/139878.htm



(1) The role of encode is to encode Unicode objects into other encoded strings, Str.encode (' utf-8 '), encoded as UTF-8, and (2) decode to convert other encoded strings into Unicode encoding, Str.decode (' UTF-8 ');


    • Import Chardet Check the specific encoding type, chardet.detect(str) but Str cannot be a Unicode encoding type, but the method does not accept encoded parameters that are already Unicode , there will be typeerror:expected Object of type bytes or ByteArray, got: <type ' Unicode ' > error;
    • As a unified standard, Unicode can no longer be decoded, and if UTF-8 wants to go to other non-Unicode, it must (2) first decode to Unicode, encode to other non-Unicode encodings.


When you crawl a Web page, you can see how the page is encoded in F12 elements meta.






(2) Chinese, dictionaries in Python can be serialized into JSON file to be stored in JSON






With open("anjuke_salehouse.json","w",encoding='utf-8') as f:
     Json.dump(all_house,f,ensure_ascii=False,sort_keys=True, indent=4);
     Print(u'loaded into file completion...');


Storing data





    • The first parameter of dump () is the object to serialize, the second argument is the open file handle, note that when the file is  open edopen() with UTF-8 encoding Open, at dump() the time also add ensure_ascii=false, Otherwise it will become ASCII code written to the JSON filejson.dump(all_house,f,ensure_ascii=False,sort_keys=True, indent=4)


Json.dumps ()/json.loads () etc usage






Json_str = json.dumps(all_house,ensure_ascii=False); #all——books is a data structure that comes with pythons such as lists and dictionaries. Write it as json.
#print json_str; #[{"brokername": "Wang Dongyu"},{},{}]
New_dict = json.loads(json_str);#Mainly when reading json files, you need to use
#print new_dict; #{u'house_area': u'95', u'build_year': u'2005'} 
    • Json.dumps () is the conversion of a Python data structure into a JSON-encoded string,
      {"Name": "Xiaoming"}





Json.loads () Converts a JSON-encoded string (dictionary form) into a Python data structure, {u ' name ': U ' xiaoming '}






After the dumps conversion, the keys and values become double quotes , and when the loads becomes a Python variable, the elements become single quotes, and the string is added with a U.
General requirements when a string is converted to a Python data type by loads, the outer layer is enclosed in single quotes, with the element key and value enclosed in double quotes.


    • Sort_keys: Sort by key


The difference between dump and dumps


    • dumps(obj, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, encoding='utf-8', default=None, sort_keys=False, **kw)Dump an object into a file, dump needs a parameter similar to a file pointer (not a real pointer, which can be called a class file object), can be combined with the file operation, that is, you can convert the dict to str into the file, asjson.dump(all_house,f,ensure_ascii=False,sort_keys=True, indent=4)in thefA handle to a JSON file that represents a data to be written to;
    • dump(obj, fp, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, encoding='utf-8', default=None, sort_keys=False, **kw), and dumps (str) directly to STR, that is, directly to the dictionary to STR, without writing to the file, like a data format conversion method, the Python string into a JSON dictionary.
    • So dumps is converting dict into str format, loads is converting str into dict format.
      Dump and load are similar functions, just combined with file operations.


(3) Chinese deposit txt










f=open('net_saving_data.txt','w',encoding='utf-8');
for item in all_house:
    # house_area=item['house_area'];
    # price=item['price'];
    output='\t'.join([str(item['house_area']),str(item['price']),str(item['build_year']),str(item['house_title'])]);
    f.write(output);
    f.write('\n');
f.close();


    • In the 2.7.15 version of Python, you are prompted with an errorTypeError: 'encoding' is an invalid keyword argument for this functionand cannot pass in the encoding parameter, but in the 3.7 version you can pass in the encoding= ' utf-8 ' parameter to write the TXT in Chinese.
!! NOTE
    • The Chinese write txt, json file is nothing more than the open () file, need to add utf-8,Dump (), need to add ensure_ascii=false, prevent ASCII encoding, But just started because the Python version is 2.7.15, not 3.7, causing the storage to be unsuccessful, always thought to be a problem with the code. So the final discovery is the version of the problem, but also quite hurt. There are many questions about Chinese on the internet, but they do not emphasize the Python version of the problem!!! The other 3.XX versions have not been tried.
    • Reading the Web page data, viewing the page charset, and Chardet Library of the encoding type of query, timely decode and encode encoding conversion, should be able to avoid a lot of coding problems. The rest of the pit stepped up and mended it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.