Python 2 obtains the encoding of Chinese file names.
Problem:
Python2 obtains the file name containing Chinese characters. If the file name is not transcoded, garbled characters may occur.
Assume that the name of the folder to be tested is test. Five files with Chinese names under the folder are:
Pythonability Analysis and Optimization
Pythondata Analysis and Comparison
Python Programming Practice: high-quality program development created by operating design model concurrency and Library
Fluent python.pdf
Compile 59efficient coding codes for high-quality pythoncodes
First, print the obtained file name without transcoding. The Code is as follows:
import osfor file in os.listdir('./test'): print(file)
Output garbled characters:
Python���ܷ������Ż�.pdfPython���ݷ������ھ�ʵս.pdfPython���ʵս���������ģʽ�������ͳ���ⴴ������������.pdf������Python.pdf��д������Python�����59����Ч����.pdf
Solution:
First, test the file name encoding. Here we use the chardet module and the installation command:
pip install chardet
Use the chardet. detect function to check the file name encoding method:
{'confidence': 0.99, 'encoding': 'GB2312'}{'confidence': 0.99, 'encoding': 'GB2312'}{'confidence': 0.99, 'encoding': 'GB2312'}{'confidence': 0.73, 'encoding': 'windows-1252'}{'confidence': 0.99, 'encoding': 'GB2312'}
We can see that the GB2312 encoding has the highest confidence level. We use GB2312 encoding to decode the file name. The Code is as follows:
import osimport chardetfor file in os.listdir('./test'): r = file.decode('GB2312') print(r)
Output:
Pythonability Analysis and Optimization
Pythondata Analysis and Comparison
Python Programming Practice: high-quality program development created by operating design model concurrency and Library
Fluent python.pdf
Compile 59efficient coding codes for high-quality pythoncodes
After encoding, the file name is printed correctly.
PS: the longer the Character String Detected by chardet. detect, the more accurate it is. The shorter it is, the less accurate it is.
Another problem is that the above Code is tested in Windows, and the file name encoding in Linux is UTF-8. To be compatible with Windows and Linux, You need to modify the code, the code is encapsulated in the function:
#-*-Coding: UTF-8-*-import osdef get_filename_from_dir (dir_path): file_list = [] if not OS. path. exists (dir_path): return file_list for item in OS. listdir (dir_path): basename = OS. path. basename (item) # print (chardet. detect (basename) # Find out the file name encoding. The file name contains Chinese characters # in windows, the file encoding is GB2312, and in linux, the file name is UTF-8 try: decode_str = basename. decode ("GB2312") Doesn't UnicodeDecodeError: decode_str = basename. decode ("UTF-8") file_list.append (decode_str) return file_list # test code r = get_filename_from_dir ('. /test') for I in r: print (I)
First use GB2312 decoding. If an error occurs, use UTF-8 decoding. This is compatible with Windows and Linux (tested in Win7 and Ubuntu16.04 ).
The above discussion about how to get the encoding of Chinese file names in Python2 is all the content shared by xiaobian. I hope to give you a reference, and I hope you can provide more support to the customer's house.