PDF has a conversion, if the non-scanning, conversion quickly, the recognition rate of 100%, a lot of software can go, if it is scanned, it is more troublesome to use OCR technology (word recognition).
Under Linux:
Dependent package Poppler-utils Tesseract Tesseract-ocr-chi-sim
Situation one: Pdftotxt command can be converted to non-scanned version, free and convenient, only the format, fonts are gone
Case two: pdftoppm + TESSERACR can achieve the conversion of the sweep version
Situation one operation Pdftotxt Name.pdf new.txt
Situation two operation first step: pdftoppm name.pdf new generates NEW-1.PPM new-2.ppm one per page;
The second step: TESSERACR new-1.ppm result will generate Result.txt, can be converted by a write script, and finally into a txt
My script is as follows:
The first step: pdftoppm test.pdf b-r 450-freetype yes; I test when the DPI is 450 is better recognition, ppm file is not very large, a single around 60M
Step Two: Scripts
For i in ' ls b-*.ppm ' #注意路径
Do
N=1
While [$n-eq 1] #监测任务
Do
Num= ' PS aux|grep tesser|wc-l ' #并发任务不超过四个, self-modifying, attention to memory and CPU
If [$num-le 4]
Then
Tesseract $i $i-L Chi_sim & #任务-L Chi_sim is the specified content is Chinese, will generate a lot of txt, and finally do not forget to merge.
N=0
Else
Sleep 3
Fi
Done
Done
Convert Linux under PDF to txt