Python contains the encoding inconsistency problem for Chinese strings.
By looking at the encoding, you find that the encoding of the two strings is missing differently.
Example:
in [+]: ucontent = U ' Lijiang Travel (sz002033) #股票 # #炒股 # #财经 # #理财 # #投资 # Recommended package win shares, profit half-divided, not counting the principal, group: 46251412 '
In []: ucontent
OUT[44]: U ' \u4e3d\u6c5f\u65c5\u6e38 (sz002033) #\u80a1\u7968##\u7092\u80a1##\u8d22\u7ecf##\u7406\u8d22##\u6295\ U8d44#\u63a8\u8350\u5305\u8d62\u80a1\uff0c\u76c8\u5229\u5bf9\u534a\u5206\u6210\uff0c\u4e0d\u7b97\u672c\u91d1\ uff0c\u7fa4\uff1a46251412 '
in [+]: Content
OUT[45]: ' \xe4\xb8\xbd\xe6\xb1\x9f\xe6\x97\x85\xe6\xb8\xb8 (sz002033) #\xe8\x82\xa1\xe7\xa5\xa8##\xe7\x82\x92\xe8 \x82\xa1##\xe8\xb4\xa2\xe7\xbb\x8f##\xe7\x90\x86\xe8\xb4\xa2##\xe6\x8a\x95\xe8\xb5\x84#\xe6\x8e\xa8\xe8\x8d\ X90\xe5\x8c\x85\xe8\xb5\xa2\xe8\x82\xa1\xef\xbc\x8c\xe7\x9b\x88\xe5\x88\xa9\xe5\xaf\xb9\xe5\x8d\x8a\xe5\x88\ X86\xe6\x88\x90\xef\xbc\x8c\xe4\xb8\x8d\xe7\xae\x97\xe6\x9c\xac\xe9\x87\x91\xef\xbc\x8c\xe7\xbe\xa4\xef\xbc\ x9a46251412 '
in [+]: Print content
Lijiang Travel (sz002033) #股票 # #炒股 # #财经 # #理财 # #投资 # Recommended package win shares, profit half-divided, not counting the principal, group: 46251412
In [47]:
Workaround:
Converts a normal string into Unicode encoded format.
Example:
In [newcontent]: = Unicode (content, "UTF8")
In []: newcontent
OUT[48]: U ' \u4e3d\u6c5f\u65c5\u6e38 (sz002033) #\u80a1\u7968##\u7092\u80a1##\u8d22\u7ecf##\u7406\u8d22##\u6295\ U8d44#\u63a8\u8350\u5305\u8d62\u80a1\uff0c\u76c8\u5229\u5bf9\u534a\u5206\u6210\uff0c\u4e0d\u7b97\u672c\u91d1\ uff0c\u7fa4\uff1a46251412 '
In [49]:
This has solved the problem.
Not resolved:
Still don't know what the original normal string encoding format is? You can only see the simple difference between the two by Type view.
Example:
in [+]: type (content)
OUT[49]: Str
in [[]: Type (ucontent)
OUT[50]: Unicode
In [Wuyi]: type (newcontent)
OUT[51]: Unicode
In [52]:
' ASCII ' codec can ' t decode byte 0xef in position 0:ordinal not in range (128)