Python character string recognition module chardet for simple application, pythonchardet
Python string encoding recognition module (third-party library ):
Official Address: http://pypi.python.org/pypi/chardet
Import chardetimport urllib # You can select different data TestData = urllib as needed. urlopen ('HTTP: // www.baidu.com /'). read () print chardet. detect (TestData) # running result: # {'confidence ': 0.99, 'encoding': 'gb2312'} the running result indicates that there is a 99% probability that the code is encoded in GB2312 mode. Import urllibfrom chardet. universaldetector import UniversalDetectorusock = urllib. urlopen ('HTTP: // www.baidu.com/') # create a detection object detector = UniversalDetector () for line in usock. readlines (): # perform a test in multiple parts until the detector threshold is reached. feed (line) if detector. done: break # disable the detection object detector. close () usock. close () # print detector. result # running result: # {'confidence ': 0.99, 'encoding': 'gb2312 '}
Background: If you want to identify a large file by encoding, you can read only one file to identify the encoding method and improve the detection speed. If you want to use one detection object to detect multiple data items, you must run detector. reset () after each detection (). Clear the previous data.
The above is all the content of this article. I hope you will like it.