Ocr text recognition in Ubuntu (pdf, tif, etc)

Source: Internet
Author: User
Tags linux mint
I usually use a scanned copy or pdf to view documents. However, when the ipad is relatively small in text, it cannot be effectively zoomed in. It is inconvenient to move the screen every time I read the documents, to solve this problem, we want to extract text from a pdf or image, which can be effectively processed. Of course, ocr technology is required. Now we will consider and solve this problem. 1. Technical preparation: the OS is ocr mint13 (based on Ubuntu12.04) ocr software: tesseract, And the execution file is tesseract

I usually use a scanned copy or pdf to view documents. However, when the ipad is relatively small in text, it cannot be effectively zoomed in. It is inconvenient to move the screen every time I read the documents, to solve this problem, we want to extract text from a pdf or image, which can be effectively processed. Of course, ocr technology is required. Now we will consider and solve this problem.

1. Technical preparation:

OS is linux mint 13 (based on Ubuntu 12.04)

Ocr software: tesseract. The execution file is tesseract.

Gocr

Pdf Processing Software: pdftoxxx, such as pdftotext

Tiff processing such as case: for example, 1272pdf

2. Install software

Sudo apt-get install gocr

Sudo apt-get install tesseract-ocr

Sudo apt-get install libtiff-tools

To set a Language Pack for tesseract, you can download a specific Chinese Language Pack on the watch. For example, the simplified version is chi_sim and then add it to an environment variable.

Mv chi_sim.traineddata/usr/local/share/tessdata

Export TESSDATA_PREFIX =/usr/local/share/

3. Convert the tif file to text tif --> text

Use tesseract directly, as shown below:

Tesseract a. tif a.txt-l chi_sim

The above multi-page single file tif is supported

4. Convert pdf files to text pdf --> text

If the pdf file is in the text format, it is easy to convert it directly.

Pdftotext a.pdf a.txt

If the content in the pdf is an image, the content cannot be used in the above method. The Prime Minister converts the content in the pdf file into ppm, and then changes the content from ppm to text: pdf --> multiple ppm --> multiple txt

Pdf 2ppm a.pdf

Generate a1.ppm, a2.ppm .....

Then convert data using tesseract

Tesseract a1.tif a1.txt-l chi_sim

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.