OCR recognition of PDF files based on Python

Source: Internet
Author: User



You may have heard of using Python for OCR recognition operations. In Python, the most famous library is the tesseract that Google has funded. With tesseract, images can be easily identified. Now the question is, what if you want to make OCR recognition for a PDF document? Take a look below.

When you're working on a project recently, you need to take the PDF file as input, output text from it, and then save the text in the database. For this reason, I searched for a long-lasting solution, and finally decided to use tesseract. So don't waste your time, let's get started.

1. Installing Tesseract

It is easy to install tesseract in different systems. For simplicity, let's take Ubuntu as an example.

In Ubuntu you just need to run the following command:

This will install tesseractthat support 3 different languages.

2. Installing PYOCR

Now we also need to install the tesseract python interface. Fortunately, there are many excellent Python interfaces.

We use the latest one:

3. Installing Wand and PIL

Before we start, we also need to install two additional dependent packages. One is Wand. It is the Imagemagick python interface.

We need to use it to convert a PDF file into an image:

We also need PIL because pyocr need to use it. You can view the official documentation to determine how to install PIL into your operating system.

4. Warm Up

Let's start our script. First, we need to import some important libraries:

Note: I renamed the Image module imported from PIL to Pi, because if you do not do this, it will conflict with thewand.imagemodule duplicate name.

5. Start

Now we need to get the handle to the OCR library (in this case, tesseract) and the language we will use in PYOCR :

We usetool.get_available_languagesthe second language in, because I have tried before, the second language is English.

Next, we need to create two lists for storing our images and the final text.

Next, we need to use wand to convert a PDF file into a JPEG file. Let's give it a try!

Note: replace Pdf_file_name with an available PDF file name under the current path.

Wand has turned all the standalone pages in the PDF into separate binary image objects. We can traverse this large object and add them to the req_image sequence.

Now, we just need to run OCR on the Image object, which is very simple:

Now, all the recognized text has been added to the final_text sequence. You can use it any way you please. The above is the use of Python to make OCR recognition of the entire content of the PDF file, I hope this tutorial can help you!

English Original: https://pythontips.com/2016/02/25/ocr-on-pdf-files-using-python/

OCR recognition of PDF files based on Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.