The use of Python pytesser module, originally wanted to do is the image of Chinese recognition, engaged for some time, in the Chinese recognition there are still a lot of problems, here to do record sharing.
Pytesser, OCR in Python using the Tesseract engine from Google. is a module of the Google OCR Open source project, which converts the text in the image to text (mainly in English).
1.pytesser Installation
Equipment used: Win8 64-bit
Pytesser uses the tesseract OCR engine to convert the image to an acceptable format, and then executes tesseract to extract the text information . With Pytesser, you do not need to install the Tesseract OCR engine, but you must first install the PIL module (the graphics library of Python Image Library,python)
Pytesser Download: http://code.google.com/p/pytesser/If not open, can be downloaded through the Baidu network disk: Http://pan.baidu.com/s/1o69LL8Y
PIL Official Download: http://www.pythonware.com/products/pil/
Where PIL can be directly clicked EXE installation, Pytesser without installation, after decompression can be placed in the Python installation folder \lib\site-packages\ under the direct use (need to add pytesser.pth)
2.pytesser Source Code
By looking at pytesser.py's source code, you can see several key functions:
(1) call_tesseract (Input_filename, Output_filename)
The function calls tesseract external execution program to extract text information from the picture
(2) image_to_string (IM, cleanup = cleanup_scratch_flag)
This function handles an image object, so you need to use IM = open (filename) to open the file and return an image object. It calls Util.image_to_scratch (IM, scratch_image_name) to save the in-memory image file as BMP so that the TESSERAC program can handle it properly.
(3) image_file_to_string (filename, cleanup = cleanup_scratch_flag, graceful_errors=True)
The function directly reads the image file using Tesseract, and if the image is incompatible, it is converted to a compatible format before extracting the text information from the picture.
"""OCR in Python using the Tesseract engine from googlehttp://code.google.com/p/pytesser/by Michael J.T. O ' Kellyv 0.0.1, 3/10/07"""ImportImageImportsubprocessImportutilImportErrorstesseract_exe_name='tesseract' #Name of executable to being called at command lineScratch_image_name ="temp.bmp" #This file must is. bmp or other tesseract-compatible formatScratch_text_name_root ="Temp" #Leave out the. txt extensionCleanup_scratch_flag = False#temporary files cleaned up after OCR operationdefcall_tesseract (Input_filename, output_filename):"""Calls external Tesseract.exe on input file (restrictions on types), outputting output_filename+ ' txt '"""args=[Tesseract_exe_name, Input_filename, Output_filename] proc=subprocess. Popen (args) Retcode=proc.wait ()ifretcode!=0:errors.check_for_errors ()defImage_to_string (IM, cleanup =cleanup_scratch_flag):"""converts IM to file, applies tesseract, and fetches resulting text. If cleanup=true, delete scratch files after operation.""" Try: Util.image_to_scratch (IM, scratch_image_name) call_tesseract (Scratch_image_name, Scratch_text_name_roo T) text=Util.retrieve_text (scratch_text_name_root)finally: ifCleanup:util.perform_cleanup (Scratch_image_name, Scratch_text_name_root)returntextdefimage_file_to_string (filename, cleanup = Cleanup_scratch_flag, graceful_errors=True):"""applies tesseract to filename; or, if image was incompatible and graceful_errors=true, converts to compatible format And then applies tesseract. fetches resulting text. If cleanup=true, delete scratch files after operation.""" Try: Try: call_tesseract (filename, scratch_text_name_root) text=Util.retrieve_text (scratch_text_name_root)excepterrors. Tesser_general_exception:ifGraceful_errors:im=image.open (filename) text=image_to_string (IM, cleanup)Else: Raise finally: ifCleanup:util.perform_cleanup (Scratch_image_name, Scratch_text_name_root)returntextif __name__=='__main__': Im= Image.open ('phototest.tif') Text=image_to_string (IM)PrinttextTry: Text= Image_file_to_string ('fnord.tif', graceful_errors=False)excepterrors. Tesser_general_exception, Value:Print "fnord.tif is incompatible filetype. Try graceful_errors=true" PrintValue Text= Image_file_to_string ('fnord.tif', graceful_errors=True)Print "fnord.tif Contents:", text text= Image_file_to_string ('Fonts_test.png', graceful_errors=True)PrintText
3.pytesser use
To load the Pytesser module in the code, the simple test code is as follows:
fromPytesserImport*im= Image.open ('Fonts_test.png') Text=image_to_string (IM)Print "Using image_to_string ():"PrintTexttext= Image_file_to_string ('Fonts_test.png', graceful_errors=True)Print "Using image_file_to_string ():"PrintText
The recognition results are as follows: Basic can be extracted from English characters, but for some complex points of the picture, for example, I try to identify some English paper images, but the results are not ideal.
Since there are many problems in Chinese recognition, we will further study and share them later.
Reference: HK_JH's Column http://blog.csdn.net/hk_jh/article/details/8961449
Python uses Pytesser module to recognize image text