Windows 10 Home Chinese version, Python 3.6.4,virtualenv 16.0.0,scrapy 1.5.0,
A crawler project (Startproject) was established using the Scrapy command-line tool, and a crawler was built using Scrapy Genspider to crawl the headlines and links on the home page of a Chinese portal site. The entire process is performed in a virtual environment (VIRTUALENV).
Using Scrapy crawl to execute the crawler and import a JSON file, you can see that the Command line window displays a news headline in Chinese, but when you open the exported JSON file, its news title appears as a Unicode encoding that begins with \u :
V.s.
The contents of the file are not orphaned and need to be displayed in Chinese.
After using encode (' utf-8 '), encode (' GBK ') and so on, the problem of the result is not resolved , and the execution of the crawler is abnormal.
Until the discovery of the blogger "Watermelon melon" article, the problem was resolved: the execution of scrapy crawl when adding configuration- s feed_export_encoding=utf-8.
Configuration item feed_export_encoding 's official website introduction:
View the Help information for the Scrapy Crawl command: You can see that the-O file option is "dump", which is seen in the JSON module when used, whereas in JSON, the non-ASCII in the dump to the file is converted to the beginning of the \u. However, this help message does not say how to change or set.
It was not until today (30th) that I looked at Scrapy's settings documentation for a more thorough understanding of this issue:
Add a feed_export_encoding configuration item to a crawler or crawler item to solve the problem, which can be at the command line level (highest), at the project level, at the crawler level, and by default, configured in any one place, Files that are output when you use-O are encoded according to this configuration.
In the melon's blog post, it is set at the command line, with the highest priority ( method one ).
Today, I tried to set it up in the configuration file settings.py of the crawler project and get the desired result ( method two ): You do not need to add the feed_export_encoding option to the command line at this time.
Of course, there are methods three, four or five , not much to say, we can carefully look at the Scrapy settings document.
Postscript
How to set this everywhere code is best practices? Which level to set? Do you need to set?
In the example in this article, if you do not set it, Python reads the exported file, and the loads using JSON can also get the correct content.
How to solve this problem with code? If you use the command line-O, you cannot resolve it, only by opening it yourself, storing the content in a way that resolves it.
During the test, the orphan also tries to write the configuration to the scrapy.cfg of the project, which, of course, is wrong.
Encoding settings when you use the Scrapy command-line tool to "Export JSON files"