Recently in this news crawler to do text analysis, down from the Internet some crawler code source used is
https://jooop.github.io/2017/01/29/python3%E7%BD%91%E6%98%93%E7%88%AC%E8%99%AB/#1-%e6%a8%a1%e5%9d%97%e7%9a%84% e9%80%89%e6%8b%a9%e5%92%8c%e5%88%97%e8%a1%a8%e9%a1%b5%e9%9d%a2%e7%9a%84%e7%88%ac%e5%8f%96%ef%bc%9a
Python 2.7+mysql5.6+window7 system +pycharm (IDE) can be used directly
Because the crawler involves Chinese storage to the MySQL database, so the middle experienced a Chinese garbled display, Chinese storage to the database is not normal display, from the Python side printed characters are not Chinese display problems
In the final Jiede this is a coding format problem. So write down as a note, in the data circle of people, how can bypass the coding format ....
First, the Python-side coding problem:
Python2 (including Python26, Python27, etc.) string usually contains str, Unicode two types, usually str string encoding method is determined by the source code file encoding, the current use of the basic is UTF-8 encoding format, So to specify the encoding format in the header of the py file: #-*-Coding:utf-8-*-
Inside a python program, the usual string is Unicode encoding, a string character that is a memory-encoded format that, if stored in a file or log, requires A Unicode-encoded string is converted to a storage encoding format for a specific character set.
What is Unicode and UTF-8? What is the connection between Unicode and UTF-8?
Unicode (Uniform Code, universal Code, single Code) is an industry standard in the field of computer science, including character set, encoding scheme, etc. Unicode is created to address the limitations of traditional character encoding schemes, which set a uniform and unique binary encoding for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing. As long as there is Unicode encoding system on the computer, no matter what kind of text in the world, only need to save the file, save the Unicode encoding can be interpreted by other computer normal.
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, which is also a prefix code, also known as the Universal Code.
In a nutshell, Unicode is a concept, and UTF-8 is the instantiation of the Unicode concept. (The boss says we're going to have a big data architecture (this is where the concept boss doesn't know what the implementation standard is for Unicode), and the programmer has a Hadoop architecture (UTF-8) that's the implementation of the big Data architecture.)
Here is the code for the experiment in Python
EX1:
IN[10]: "Chinese"
OUT[10]:
' \xe4\xb8\xad\xe6\x96\x87 '
This example directly enters the value of the UTF-8 encoded format printed in Chinese
- \x: only 16 binary meaning, followed by two bits, then the single-byte encoding;
EX2: If you want to print out Chinese must be in front of the print I can't help but wonder why? My understanding is that if you do not add Print,python do not think you want to print display, just display the data, it is a lazy way to directly display the word in the computer encoding, if you add print, he understood that you asked to print out, Print it out according to the meaning of your actual representative. The premise is that your system is UTF-8 encoded format OH. If it is not UTF-8 encoding format, print out is garbled. To change the encoding format, see
IN[11]: print "Chinese"
Chinese
IN[12]: sys.getdefaultencoding ()
OUT[12]:
' Utf-8 '
Answer:print the process of printing the display
Figure 1. Print Printing display process
When you call print in Python2.7 for a VAR variable, the operating system will handle Var with a certain character: if Var is a variable of type str, the VAR variable is delivered directly to the terminal for display, and if the Var variable is a Unicode type, The operating system first encodes Var into an object of type str (the encoding format depends on the encoding format of the STDOUT), which is then presented to the terminal. In the terminal display, if the str type of the variable encoding method and the terminal settings are not encoded in the same way, it is likely that garbled problems.
Chinese processing in a ex3, list, or dictionary
data = {"A": "Hello", "B": "China"} #假设是utf-8 format
At this point we use Print to output data directly, or use the STR function to convert data to a string. The Chinese is a Unicode character, such as:
>>> data = {"A": "Hello", "B": "China"}
>>> Print Data
{' A ': ' Hello ', ' B ': ' \xd6\xd0\xb9\xfa '}
Output Chinese fields separately no problem, such as
>>> print data[' B ']
China
If you want to be able to output the entire dictionary normally, you can take advantage of the JSON package dump method, such as:
>>> data = {"A": "Hello", "B": "China"}
>>> s = json.dumps (Data,ensure_ascii=false);
>>> Print S
{"A": "Hello", "B": "China"}
>>> print isinstance (S,STR)
True
Then say how these data are stored properly in MySQL
First MySQL to support UTF-8 encoded storage, need to go to the MySQL installation file My.ini configuration in the configuration
[Client]
#password = Your_password
Port= 3306
socket=/tmp/mysql.sock
Default-character-set=utf8
[Mysqld]
port=3306
Character-set-server=utf8
Collation-server=utf8_general_ci
Second, make sure that the default storage for the tables and fields you created is also in utf-8 format, and how to view and change them
Https://www.cnblogs.com/wcwen1990/p/6917109.html can refer to this page
Then you need to store the Utf-8 character directly in its own character instead of the computer's default binary byte, and you can
Str.decode ("Unicode_escape") implementation
Yes, I didn't do it. Decode conversion, directly stored in the Unicode encoding file
The storage after adding decode ("Unicode_escape") is just the normal text.
Python MySQL utf-8 Latin