Http://www.jb51.net/article/89955.htm
https://pythontips.com/2016/02/25/ocr-on-pdf-files-using-python/
You may have heard of using Python for OCR recognition operations. In Python, the most famous library is the tesseract that Google has funded. With tesseract, images can be easily identified. Now the question is, what if you want to make OCR recognition for a PDF document? Take a look below.
When you're working on a project recently, you need to take the PDF file as input, output text from it, and then save the text in the database. For this reason, I searched for a long-lasting solution, and finally decided to use tesseract. So don't waste your time, let's get started.
1. Installing Tesseract
It is easy to install tesseract in different systems. For simplicity, let's take Ubuntu as an example.
In Ubuntu you just need to run the following command:
This will install tesseractthat support 3 different languages.
2. Installing PYOCR
Now we also need to install the tesseract python interface. Fortunately, there are many excellent Python interfaces.
We use the latest one:
3. Installing Wand and PIL
Before we start, we also need to install two additional dependent packages. One is Wand. It is the Imagemagick python interface.
We need to use it to convert a PDF file into an image:
We also need PIL because pyocr need to use it. You can view the official documentation to determine how to install PIL into your operating system.
4. Warm Up
Let's start our script. First, we need to import some important libraries:
Note: I renamed the Image module imported from PIL to Pi, because if you do not do this, it will conflict with thewand.imagemodule duplicate name.
5. Start
Now we need to get the handle to the OCR library (in this case, tesseract) and the language we will use in PYOCR :
We usetool.get_available_languagesthe second language in, because I have tried before, the second language is English.
Next, we need to create two lists for storing our images and the final text.
Next, we need to use wand to convert a PDF file into a JPEG file. Let's give it a try!
Note: replace Pdf_file_name with an available PDF file name under the current path.
Wand has turned all the standalone pages in the PDF into separate binary image objects. We can traverse this large object and add them to the req_image sequence.
Now, we just need to run OCR on the Image object, which is very simple:
Now, all the recognized text has been added to the final_text sequence. You can use it any way you please. The above is the use of Python to make OCR recognition of the entire content of the PDF file, I hope this tutorial can help you!
English Original: https://pythontips.com/2016/02/25/ocr-on-pdf-files-using-python/
OCR recognition of PDF files based on Python