Environment Version: WIN10 | Python 3.6 | Imagemagick-6.9.9-38-q8-x64-dll | Ghostscript 9.22 for Windows
Overall idea: 1. Convert PDF to image for text recognition | 2. Use Pdfminer to parse PDF files (higher accuracy)
Directory
1. Download and install tesseract 2. Install PYOCR, Wand, Pillow 3. Download installation ImageMagick, Ghostscript 4. Configure TESSDATA_PREFIX environment variable 5. Modify the tesseract.py file in the PYOCR package 6. Write and run a program generated picture run Results 7. Problems encountered and solutions ① run program times wrong Oserror:cannot find library; Tried paths:②delegateerror:pdfdelegatefailed ' a heap of garbled (please skip) ③delegateerror:pdfdelegatefailed ' system could not find the specified file. ④no OCR tool found or pytesseract. Tesseracterror:8. Use Pdfminer to parse PDF files (higher accuracy) ① install Pdfminer3k② write and run programs
1. Download and install Tesseract
Install the Tesseract-xxx.exe file after Github.com/ub-mannheim/tesseract/wiki download. It should be noted that in selecting the installation component point of "Language data" to choose the language you want to identify, you can only identify English yo. 2. Installation of PYOCR, Wand, Pillow
Pip install PYOCR
pip install Wand
pip install Pillow
If the Python version is 2.x, you will need to download the Python Imaging Library (PIL) exe file in http://pythonware.com/products/pil/to install it. 3. Download and install ImageMagick, Ghostscript
Wand rely on Imagemagick,imagemagick rely on Ghostscript, go to the link below to download the installation can.
Imagemagick:http://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-on-windows
be careful not to download the ImageMagick 7.x version, wand not supported.
ghostscript:https://ghostscript.com/download/gsdnld.html 4. Configure Tessdata_prefix Environment Variables
The variable value is the Tesseract installation directory, restart the project after the change. 5. Modify the tesseract.py file in the PYOCR package
# change this IF tesseract isn't in YOUR PATH, or is NAMED differently
#TESSERACT_CMD = ' Tesseract.exe ' if os.name = = ' NT ' Else ' tesseract '
#TESSERACT_CMD = ' D:/program Files (x86)/tesseract-ocr/tesseract.exe ' If Os.name = ' nt ' Else ' Tesseract '
tesseract_cmd = os.environ["Tessdata_prefix"] + '/tesseract.exe ' if Os.name = ' nt ' Else ' tesseract '
6. Write and run the program
Before we finish the preparation, let's run the program.
#-*-Coding:utf-8-*-Py3 does not need from wand.image import image from PiL import image as PI import PYOCR import pyocr.builders Import IO Import sys def main (): Tools = Pyocr.get_available_tools () If Len (tools) = = 0:print ("No OCR Tool found ") Sys.exit (1) tool = Tools[0] Print (" Would use tool '%s ' "% (Tool.get_name ()) Langs = Tool . Get_available_languages () print ("Available languages:%s"%, ". Join (langs)) lang = langs[0] Print (" would use Lang '%s '% (lang)) req_image = [] Final_text = [] image_pdf = Image (Filename= "./pdf_file/stackoverflow.pd F ", resolution=400) Image_jpeg = Image_pdf.convert (' jpeg ') image_jpeg.save (filename= './pdf2img/stackoverflow.jpeg ') For img in image_jpeg.sequence:img_page = Image (Image=img) req_image.append (Img_page.make_blob (' JPE G ')) for an img in req_image:txt = tool.image_to_string (Pi.open (IO). Bytesio (IMG)), Lang=lang, BuildeR=pyocr.builders.textbuilder ()) final_text.append (TXT) print (final_text) # for text in Final_text
: # print (text) if __name__ = = ' __main__ ': Main ()
generated Pictures
Run Results
7. Problems encountered and Solutions ① Run program times wrong Oserror:cannot find library; tried paths:
' D:\Program Files\imagemagick-7.0.7-q16\core_rl_wand_.dll ',
' D:\Program Files\imagemagick-7.0.7-q16\libmagickwand.dll ',
...
Went to the official website to look at the discovery of this passage:
Wand yet doesn ' t support ImageMagick 7 which has several APIs with incompatible previous. For more details, the issue #287.
Reinstall the Imagemagick-6.9.9-38-q8-x64-dll version problem after uninstalling. ②delegateerror:pdfdelegatefailed ' A bunch of garbled (please skip)
Check a half-day see a friend also use python3.6 encounter this problem, but there is no solution, thought to switch into python2.7 try. If you need to switch the Python version, see one.
(Implementing multiple versions of Python Spyder coexistence in Anaconda3)
Full operation in Anaconda cmd
1 first create an environment named Python2 in Conda and download the corresponding version python2.7
Conda Create–name Python2 python=2.7
2) activating Python2 Environment
Activate Python2
3 Download Spyder and Jupter in Python2 environment notebook
Conda Install Spyder
#conda Install jupyter ③delegateerror:pdfdelegatefailed ' system could not find the specified file.
' @ error/pdf.c/readpdfimage/
Before the operation in the Pycharm, switch into python2.7 after the first in Spyder tried, the results before the garbled into Chinese. Check it out. Because the Ghostscript is not installed, the problem is resolved after installation. Then I switched to python3.6, also do not give an error.
ghostscript:https://ghostscript.com/download/gsdnld.html ④no OCR tool found or pytesseract. Tesseracterror:
(1, ' Error opening data file \program Files (x86) \tesseract-ocr\tessdata/chi_sim.traineddata ')
Setting up environment variables Tessdata_prefix and modifying tesseract.py files after installing Tesseract
# change this IF tesseract isn't in YOUR PATH, or is NAMED differently
#TESSERACT_CMD = ' Tesseract.exe ' if os.name = = ' NT ' Else ' tesseract '
#TESSERACT_CMD = ' D:/program Files (x86)/tesseract-ocr/tesseract.exe ' If Os.name = ' nt ' Else ' Tesseract '
tesseract_cmd = os.environ["Tessdata_prefix"] + '/tesseract.exe ' if Os.name = ' nt ' Else ' tesseract '
D:\Program Files (x86) \TESSERACT-OCR (for reference, whichever is the actual installation path)
If you do not set an environment variable, you will be prompted when you run Tesseract on a path other than the Setup disk:
Please make sure the TESSDATA_PREFIX environment variable are set to the parent D irectory of your "tessdata" directory
Note You need to restart the project after setting the environment variable 8. Use Pdfminer to parse PDF files (higher accuracy)
See:
http://blog.csdn.net/u011389474/article/details/60139786
https://www.cnblogs.com/jamespei/p/5339769.html ① Installation pdfminer3k
Pip Install pdfminer3k
② Write and run the program
From Pdfminer.pdfparser import Pdfparser, pdfdocument from pdfminer.pdfinterp import Pdfresourcemanager, Pdfpageinterpreter from Pdfminer.converter import pdfpageaggregator from pdfminer.layout import Lttextboxhorizontal, Laparams from pdfminer.pdfinterp import pdftextextractionnotallowed ' parse PDF text, save to TXT file ' path = R './pdf_file/stack
Overflow.pdf ' Def parse (): FP = open (path, ' RB ') # opens in binary read mode # Use a file object to create a PDF document Analyzer Praser = Pdfparser (FP) # Create a PDF doc = pdfdocument () # Connection Analyzer and Document Object Praser.set_document (DOC) Doc.set_parser (praser) # provides initial Start Password # If you don't have a password, create an empty string Doc.initialize () # Detects if the document provides a TXT conversion and ignores if not doc.is_extractable:raise Pdftextextractionnotallowed Else: # Create PDF Explorer to manage shared resources Rsrcmgr = Pdfresourcemanager () # Create a PDF Device Object laparams = Laparams () device = Pdfpageaggregator (Rsrcmgr, Laparams=laparams) # Create a PDF interpreter for Like interpreter = Pdfpageinterpreter (RSRCmgr, device) # Loop through the list, process one page at a time for page in Doc.get_pages (): # doc.get_pages () get page list
Interpreter.process_page (page) # accepts the Ltpage object for the page layout = Device.get_result () "" " Here layout is a Ltpage object that contains the various objects that the page resolves, typically including Lttextbox, Ltfigure, Ltimage, lttextboxhorizontal, etc. want
Get the text to get the object's Text property "" For X in Layout:if isinstance (x, Lttextboxhorizontal):
With open (R './pdf_file/stackoverflow.txt ', ' a ') as F:results = X.get_text () Print (results) f.write (results + ' \ n ') if __name__ = = ' __main__ ': Parse ()
Reference Link: https://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/ 00140767171357714f87a053a824ffd811d98a83b58ec13000 https://www.cnblogs.com/zhiyong-ITNote/p/6852113.html http:// blog.csdn.net/huangzhang_123/article/details/61920975 https://www.cnblogs.com/wzben/p/5930538.html https:// Pythontips.com/2016/02/25/ocr-on-pdf-files-using-python http://blog.topspeedsnail.com/archives/3571 https:// www.cnblogs.com/yourstars/p/5849881.html https://segmentfault.com/q/1010000007964197