Convert Linux under PDF to txt

Source: Internet
Author: User

PDF has a conversion, if the non-scanning, conversion quickly, the recognition rate of 100%, a lot of software can go, if it is scanned, it is more troublesome to use OCR technology (word recognition).


Under Linux:

Dependent package Poppler-utils Tesseract Tesseract-ocr-chi-sim

Situation one: Pdftotxt command can be converted to non-scanned version, free and convenient, only the format, fonts are gone

Case two: pdftoppm + TESSERACR can achieve the conversion of the sweep version


Situation one operation Pdftotxt Name.pdf new.txt

Situation two operation first step: pdftoppm name.pdf new generates NEW-1.PPM new-2.ppm one per page;

The second step: TESSERACR new-1.ppm result will generate Result.txt, can be converted by a write script, and finally into a txt


My script is as follows:

The first step: pdftoppm test.pdf b-r 450-freetype yes; I test when the DPI is 450 is better recognition, ppm file is not very large, a single around 60M

Step Two: Scripts

For i in ' ls b-*.ppm ' #注意路径
Do
N=1
While [$n-eq 1] #监测任务
Do
Num= ' PS aux|grep tesser|wc-l ' #并发任务不超过四个, self-modifying, attention to memory and CPU
If [$num-le 4]
Then
Tesseract $i $i-L Chi_sim & #任务-L Chi_sim is the specified content is Chinese, will generate a lot of txt, and finally do not forget to merge.
N=0
Else
Sleep 3
Fi
Done
Done




Convert Linux under PDF to txt

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.