Crawl item:
2017-10-16 18:17:33 [Scrapy.core.scraper] debug:scraped from<200 https://www.huxiu.com/v2_action/article_list>{'author': U'\u5546\u4e1a\u8bc4\u8bba\u7cbe\u9009\xa9', 'CMT': 5, 'Fav': 194, ' Time': U'4\u5929\u524d', 'title': U'\u96f7\u519b\u8c08\u5c0f\u7c73\u201c\u65b0\u96f6\u552e\u201d\uff1a\u50cfzara\u4e00\u6837\u5f00\u5e97\uff0c \u8981\u505a\u5f97\u6bd4costco\u66f4\u597d', 'URL': U'/article/217755.html'}
Write JSOnline JL file
{"title":"\U8FD9\U4E00\U5468\UFF1A\U8D2B\U7A77\U66B4\U51FB","URL":"/article/217997.html","author":"\u864e\u55c5","Fav": 8," Time":"2\u5929\u524d","CMT": 5}{"title":"\u502a\u840d\u8001\u516c\u7684\u65b0\u620f\u6251\u8857\u4e86\uff0c\u9ec4\u6e24\u6301\u80a1\u7684\u516c\ U53f8\u8981\u8d54\u60e8\u4e86","URL":"/article/217977.html","author":"\u5a31\u4e50\u8d44\u672c\u8bba","Fav": 5," Time":"2\u5929\u524d","CMT": 3}
item is transferred to STR, default ENSURE_ASCII = True, non-ASCII characters are converted to ' \uxxxx ', each ' {XXX} ' unit is written to file
Goal: note finally with Chrome or notepad++ open confirm, Firefox open JL may appear in Chinese garbled, need to manually specify the encoding.
{"title":"this week: A critical hit of poverty","URL":"/article/217997.html","author":"Tiger Sniffing","Fav": 8," Time":"2 days ago","CMT": 5}{"title":"ni ping Husband's new play on the street, Huang Bo holding the company to compensate miserably","URL":"/article/217977.html","author":"Entertainment Capital","Fav": 5," Time":"2 days ago","CMT": 3}
Resources
Scrapy crawl to Chinese, save to JSON file for Unicode, how to resolve.
ImportJSONImportCodecsclassJsonwithencodingpipeline (object):def __init__(self): Self.file= Codecs.open ('Scraped_data_utf8.json','W', encoding='Utf-8') defProcess_item (self, item, spider): ^M Line= Json.dumps (Dict (item), Ensure_ascii=false) +"\ n"Self.file.write (line)returnItemdefClose_spider (self, Spider): Self.file.close ()
View Code
Scrapy Output Chinese Save Chinese
Scrapy Crawler Frame Crawl Chinese results for Unicode encoding, how to convert UTF-8 encoding
Lidashuang/imax-spider
The above information is actually a pipeline example of official documentation, in addition to specifying ensure_ascii=falsewrite items to a JSON file
The following pipeline stores all scraped items (from all spiders) to a single items.jl
file, containing one item per line Serialized in JSON format:
ImportJSONclassJsonwriterpipeline (object):defOpen_spider (self, spider): Self.file= Open ('ITEMS.JL','W') defClose_spider (self, Spider): Self.file.close ()defProcess_item (self, item, spider): line= Json.dumps (Dict (item)) +"\ n"#另外指定 Ensure_ascii=falseSelf.file.write (line)returnItem
Note
The purpose of Jsonwriterpipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
A better solution:
Scrapy use the item export to export Chinese to JSON file, content is Unicode code, how to output to Chinese?
Http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence It is mentioned that the Jsonencoder ensure_ascii
parameter is set to False.
And Scrapy's item in the export document mentions
The additional constructor arguments is passed to the
Baseitemexporter constructor, and the leftover arguments to the
Jsonencoder constructor, so can use any Jsonencoder constructor
Argument to customize this exporter.
So you scrapy.contrib.exporter.JsonItemExporter
can specify it at the time of the call ensure_ascii=False
.
According to the above solution, combined with the official website and source code, the direct solution: 1. You can add feed_export_encoding = ' utf-8 ' 2 by modifying project settings.py or in CMD. G:\pydata\pycode\ Scrapy\huxiu_com>scrapy Crawl-o new.jl-s feed_export_encoding= ' utf-8 ' Huxiu
Https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-encoding
Feed_export_encoding
Default:None
The encoding to is used for the feed.
If unset or set None
to (default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding ( sequences) for historic reasons.
Use the utf-8
if you want UTF-8 for JSON too.
in [615]: Json.dump? Signature:json.dump (obj, FP, Skipkeys=false, Ensure_ascii=true, Check_circular=true, Allow_nan=true, Cls=none, Indent=none, Separators=None, encoding='Utf-8', Default=none, Sort_keys=false, * *kw) Docstring:serialize ' obj ' as a JSON formatted stream to ' FP ' (a '. Write () '-supporting file-Like object). If ' ensure_ascii 'Is true (the default), all Non-ascii charactersinchTheoutput is escaped with ' \uxxxx ' sequences, andThe result isa ' str ' instance consisting of ASCII characters only. If ' Ensure_ascii ' is"False", some chunks written to "FP" may be ' Unicode ' instances. This usually happens because the input contains Unicode stringsorThe ' encoding ' parameter isused. Unless ' fp.write () ' Explicitlyunderstands ' Unicode ' ( asinch' Codecs.getwriter ') this islikely tocause an error.
C:\Program files\anaconda2\lib\site-packages\scrapy\exporters.py
classJsonlinesitemexporter (baseitemexporter):def __init__(Self, file, * *Kwargs): Kwargs.setdefault ('Ensure_ascii', notself.encoding)classJsonitemexporter (baseitemexporter):def __init__(Self, file, * *Kwargs): Kwargs.setdefault ('Ensure_ascii', notself.encoding)classXmlitemexporter (baseitemexporter):def __init__(Self, file, * *Kwargs):if notself.encoding:self.encoding='Utf-8'
Scrapy Practice Issue 1 Unicode Chinese writing JSON file appears ' \uxxxx '