The Chinese processing of python is still quite troublesome. The UTF-8 string is 1-6 characters in length, and will be truncated accidentally, resulting in so-called garbled characters. The following function provides a string with a fixed length from a UTF-8 encoded string. Ord (char) converts characters to integers. According to the UTF-8 encoding rules, it determines that each UTF-8 character occupies several characters to avoid truncation.
Parameters:
String: UTF-8 string. If it is another character encoding, convert it to UTF-8 first (we recommend that all strings and files use UTF-8 format)
Length: number of characters (not the number of Chinese characters)
Digress:
Python character encoding, which has several functions: Unicode (STR, 'charset'), str. Decode ('charset'), str. encode ('charset ').
For example, you need to convert gb2312 to GBK, as shown below:
STR = Unicode (STR, 'gb2312') # convert to Unicode
Str. encode ('gbk') # convert to GBK
In Linux, you can use iconv-F gb2312-t gbk sourcefile> targetfile for conversion.
Def substring (string, length): <br/> If length> = Len (string ): <br/> return string <br/> result = ''<br/> I = 0 <br/> P = 0 <br/> while true: <br/> CH = ord (string [I]) <br/> # 1111110x <br/> If ch> = 252: <br/> P = P + 6 <br/> # 111110xx <br/> Elif ch> = 248: <br/> P = P + 5 <br/> # 11110xxx <br/> Elif ch >=240: <br/> P = P + 4 <br/> #1110 XXXX <br/> Elif ch> = 224: <br/> P = P + 3 <br/> #110 XXXXX <br/> Elif ch> = 192: <br/> P = P + 2 <br/> else: <br/> P = p + 1 </P> <p> If P> = length: <br/> break; <br/> else: <br/> I = P <br/> return string [0: I]
Postscript:
Later, I found a simpler method.
STR = 'China'
Str. Decode ('utf-8') [0: 1]. encode ('utf-8 ')
First convert to Unicode, then take the substring, and then convert to UTF-8