I usually use a scanned copy or pdf to view documents. However, when the ipad is relatively small in text, it cannot be effectively zoomed in. It is inconvenient to move the screen every time I read the documents, to solve this problem, we want to extract text from a pdf or image, which can be effectively processed. Of course, ocr technology is required. Now we will consider and solve this problem. 1. Technical preparation: the OS is ocr mint13 (based on Ubuntu12.04) ocr software: tesseract, And the execution file is tesseract
I usually use a scanned copy or pdf to view documents. However, when the ipad is relatively small in text, it cannot be effectively zoomed in. It is inconvenient to move the screen every time I read the documents, to solve this problem, we want to extract text from a pdf or image, which can be effectively processed. Of course, ocr technology is required. Now we will consider and solve this problem.
1. Technical preparation:
OS is linux mint 13 (based on Ubuntu 12.04)
Ocr software: tesseract. The execution file is tesseract.
Gocr
Pdf Processing Software: pdftoxxx, such as pdftotext
Tiff processing such as case: for example, 1272pdf
2. Install software
Sudo apt-get install gocr
Sudo apt-get install tesseract-ocr
Sudo apt-get install libtiff-tools
To set a Language Pack for tesseract, you can download a specific Chinese Language Pack on the watch. For example, the simplified version is chi_sim and then add it to an environment variable.
Mv chi_sim.traineddata/usr/local/share/tessdata
Export TESSDATA_PREFIX =/usr/local/share/
3. Convert the tif file to text tif --> text
Use tesseract directly, as shown below:
Tesseract a. tif a.txt-l chi_sim
The above multi-page single file tif is supported
4. Convert pdf files to text pdf --> text
If the pdf file is in the text format, it is easy to convert it directly.
Pdftotext a.pdf a.txt
If the content in the pdf is an image, the content cannot be used in the above method. The Prime Minister converts the content in the pdf file into ppm, and then changes the content from ppm to text: pdf --> multiple ppm --> multiple txt
Pdf 2ppm a.pdf
Generate a1.ppm, a2.ppm .....
Then convert data using tesseract
Tesseract a1.tif a1.txt-l chi_sim