Compile and install Tesseract-ocrposted on in centos
2012-01-30
York_gu
It has been nearly three months since the previous blog titled automatic identification of simple verification Codes Using gocr. Recently, verification codes have been cracked again, but this time, the verification code is more complicated. gocr is not powerful enough. The accuracy of pure digital recognition is indeed high, but the mixed numbers and letters cannot be handled. So this time, we changed Tesseract-OCR to an advanced one.
As the most popular free Linux release version, the yum that comes with centos is really bad, not even Tesseract-OCR.
Install tesseract. First install some dependent libraries.
1234 |
wget tar xvf leptonica-1.68.tar.gz cd leptonica-1.68 ./configure; make; make install |
Then, the source code of tesseract is compiled and installed. When writing this blog post, the latest tesseract version is 3.0.1.
123456 |
wget tar xvf tesseract-3.0.1.tar.gz cd tesseract-3.0.1 ./autogen.sh mkdir m4; ./configure make; make install |
The compilation and installation of tesseract requires automake and libtool, which can be directly installed through yum. In addition to this, the source code of tesseract also contains a trap that requires manual adjustment before compilation passes:
12 |
vim ccutil/strngs.h Delete the first garbled character in the header of the first line, that is, the word <feff> is displayed in vim. |
After the compilation and installation are complete, you also need to install the corresponding language pack. When installing the Language Pack, you only need to decompress the Language Pack and put it in the corresponding directory.
12 |
tar xvf tesseract-ocr-3.01.eng.tar.gz mv tesseract-ocr/tessdata/* /usr/local/share/tessdata/ |