Scrapy Practice Issue 1 Unicode Chinese writing JSON file appears ' \uxxxx '

Source: Internet
Author: User

Crawl item:

2017-10-16 18:17:33 [Scrapy.core.scraper] debug:scraped from<200 https://www.huxiu.com/v2_action/article_list>{'author': U'\u5546\u4e1a\u8bc4\u8bba\u7cbe\u9009\xa9', 'CMT': 5, 'Fav': 194, ' Time': U'4\u5929\u524d', 'title': U'\u96f7\u519b\u8c08\u5c0f\u7c73\u201c\u65b0\u96f6\u552e\u201d\uff1a\u50cfzara\u4e00\u6837\u5f00\u5e97\uff0c \u8981\u505a\u5f97\u6bd4costco\u66f4\u597d', 'URL': U'/article/217755.html'}

Write JSOnline JL file

{"title":"\U8FD9\U4E00\U5468\UFF1A\U8D2B\U7A77\U66B4\U51FB","URL":"/article/217997.html","author":"\u864e\u55c5","Fav": 8," Time":"2\u5929\u524d","CMT": 5}{"title":"\u502a\u840d\u8001\u516c\u7684\u65b0\u620f\u6251\u8857\u4e86\uff0c\u9ec4\u6e24\u6301\u80a1\u7684\u516c\ U53f8\u8981\u8d54\u60e8\u4e86","URL":"/article/217977.html","author":"\u5a31\u4e50\u8d44\u672c\u8bba","Fav": 5," Time":"2\u5929\u524d","CMT": 3}

item is transferred to STR, default ENSURE_ASCII = True, non-ASCII characters are converted to ' \uxxxx ', each ' {XXX} ' unit is written to file

Goal: note finally with Chrome or notepad++ open confirm, Firefox open JL may appear in Chinese garbled, need to manually specify the encoding.

{"title":"this week: A critical hit of poverty","URL":"/article/217997.html","author":"Tiger Sniffing","Fav": 8," Time":"2 days ago","CMT": 5}{"title":"ni ping Husband's new play on the street, Huang Bo holding the company to compensate miserably","URL":"/article/217977.html","author":"Entertainment Capital","Fav": 5," Time":"2 days ago","CMT": 3}

Resources

Scrapy crawl to Chinese, save to JSON file for Unicode, how to resolve.

ImportJSONImportCodecsclassJsonwithencodingpipeline (object):def __init__(self): Self.file= Codecs.open ('Scraped_data_utf8.json','W', encoding='Utf-8')    defProcess_item (self, item, spider): ^M Line= Json.dumps (Dict (item), Ensure_ascii=false) +"\ n"Self.file.write (line)returnItemdefClose_spider (self, Spider): Self.file.close ()
View Code

Scrapy Output Chinese Save Chinese

Scrapy Crawler Frame Crawl Chinese results for Unicode encoding, how to convert UTF-8 encoding

Lidashuang/imax-spider

The above information is actually a pipeline example of official documentation, in addition to specifying ensure_ascii=falsewrite items to a JSON file

The following pipeline stores all scraped items (from all spiders) to a single items.jl file, containing one item per line Serialized in JSON format:

ImportJSONclassJsonwriterpipeline (object):defOpen_spider (self, spider): Self.file= Open ('ITEMS.JL','W')    defClose_spider (self, Spider): Self.file.close ()defProcess_item (self, item, spider): line= Json.dumps (Dict (item)) +"\ n"#另外指定 Ensure_ascii=falseSelf.file.write (line)returnItem

Note

The purpose of Jsonwriterpipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.

A better solution:

Scrapy use the item export to export Chinese to JSON file, content is Unicode code, how to output to Chinese?

Http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence It is mentioned that the Jsonencoder ensure_ascii parameter is set to False.

And Scrapy's item in the export document mentions

The additional constructor arguments is passed to the
Baseitemexporter constructor, and the leftover arguments to the
Jsonencoder constructor, so can use any Jsonencoder constructor
Argument to customize this exporter.

So you scrapy.contrib.exporter.JsonItemExporter can specify it at the time of the call ensure_ascii=False .

According to the above solution, combined with the official website and source code, the direct solution: 1. You can add feed_export_encoding = ' utf-8 ' 2 by modifying project settings.py or in CMD. G:\pydata\pycode\ Scrapy\huxiu_com>scrapy Crawl-o new.jl-s feed_export_encoding= ' utf-8 ' Huxiu

Https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-encoding

Feed_export_encoding

Default:None

The encoding to is used for the feed.

If unset or set None to (default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding ( sequences) for historic reasons.

Use the utf-8 if you want UTF-8 for JSON too.

in [615]: Json.dump? Signature:json.dump (obj, FP, Skipkeys=false, Ensure_ascii=true, Check_circular=true, Allow_nan=true, Cls=none, Indent=none, Separators=None, encoding='Utf-8', Default=none, Sort_keys=false, * *kw) Docstring:serialize ' obj ' as a JSON formatted stream to ' FP ' (a '. Write () '-supporting file-Like object). If ' ensure_ascii  'Is true (the default), all Non-ascii charactersinchTheoutput is escaped with ' \uxxxx ' sequences, andThe result isa ' str ' instance consisting of ASCII characters only. If ' Ensure_ascii ' is"False", some chunks written to "FP" may be ' Unicode ' instances. This usually happens because the input contains Unicode stringsorThe ' encoding ' parameter isused. Unless ' fp.write () ' Explicitlyunderstands ' Unicode ' ( asinch' Codecs.getwriter ') this islikely tocause an error.

C:\Program files\anaconda2\lib\site-packages\scrapy\exporters.py

classJsonlinesitemexporter (baseitemexporter):def __init__(Self, file, * *Kwargs): Kwargs.setdefault ('Ensure_ascii', notself.encoding)classJsonitemexporter (baseitemexporter):def __init__(Self, file, * *Kwargs): Kwargs.setdefault ('Ensure_ascii', notself.encoding)classXmlitemexporter (baseitemexporter):def __init__(Self, file, * *Kwargs):if  notself.encoding:self.encoding='Utf-8'

Scrapy Practice Issue 1 Unicode Chinese writing JSON file appears ' \uxxxx '

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.