"This article was written on July 5, 2018"
TESS4J is the tesseract Java JNA wrapper. This article describes the steps and considerations for using tess4j in the CentOS 7 operating system. Before the official start, take a bit of space and give a brief introduction to the relevant technology.
A little background tesseract
Tesseract is a well-known open source OCR engine that supports more than 100 languages and can be used out of the box. You can also support more languages by training. Tesseract was born in 1984, from HP Company, 2005 Open source. It has been developed by Google since 2006. Until now, the latest stable version was released on June 1, 2017 in 3.05.01. There is also a more active 4.0 version based on the LSTM (long short-term memory network, a time recurrent neural network), and in development, the latest release is the June 26, 2018 4.0.0-beta.3. Tesseract is developed by C + +.
Site:
Https://github.com/tesseract-ocr/tesseract
Leptonica
Tesseract as an OCR engine, you can't avoid using image processing. The image processing used by Tesseract is mainly provided by Leptonica. The Leptonica contains many features related to image processing and image analysis.
Site:
http://www.leptonica.com/
Java JNA Wrapper
JNA is the abbreviation for Java Native Access, as its name implies, a library that implements Java calls to the operating system Native application. Mention the Java local call, people naturally associate with JNI, but the use of JNI is very complex, it will be daunting. JNA takes a more natural approach, providing support for Java applications to invoke native applications.
Site:
Https://github.com/java-native-access/jna
Tess4j
TESS4J provides Java's tesseract API through the Java JNA Wrapper, as well as a tesseract DLL for Windows 32bit and 64bit and some sample images. With tess4j, you can use Tesseract in Java very conveniently under Windows. For Linux, Macs, and other operating systems, you need to build your own tesseract to use tess4j.
In other words, native tess4j are not cross-platform and are only available for Windows out of the box.
This is also the original purpose of this paper, recorded in the Linux environment, the use of tess4j steps and drip pits.
The technical version used in this article
Why should I emphasize the version separately? Long-term open-source pit comrades must understand a fact: most of the open source projects its quality (functional availability, document correctness, update timeliness) relatively general, in the circle, must have a lot of information in the quweicunzhen of the ability to cry out for attention in the community skills, Very strong hands-on ability and indomitable spirit ...
For some technical problems, Google out of the results, a large part of the result is invalid, will waste a lot of time, or even go astray. However, the communication with many authors found that most of this kind of situation is an article introduction program incomplete, or not rigorous results.
Therefore, I think that as a share of every practice, need to have a repeatable operation of the basic requirements. Therefore, I will try to accurately repeat my practice process using the software and the environment version, I hope to be helpful to everyone.
tess4j:4.0.2
Tesseract:4.0.0-beta.1
leptonica:1.76.0
jdk:1.8 Update 102 64bit
Operating Environment: CentOS 7 (kernel: 3.10.0-862.3.3.el7.x86_64) 64bit
gcc:4.8.5
clang:3.4
Development environment: Windows 64bit
Why do you choose this?
The pit here is that you can't use the new version according to your preferences. I used Tesseract 4.0.0-beta.3, but run the JVM will report fatal error finally self-exit, read the errors to judge, probably tesseract some function signature changes, tess4j signature in the resulting.
Do you remember the Official document (Https://github.com/tesseract-ocr/tesseract/wiki)?
The official wiki mentions that the Linux so library can be installed with pre-compiled packages, and the latest version will be installed automatically if you follow the wiki. I also tried to install the old version through the Yum list, and found that even the so-called old version would cause tess4j to run the Times wrong. (The RPM packages downloaded and automatically installed via Yum are shown below)
This must be selected according to the version of the tess4j adaptation.
The changes in the last few versions are described in the versionchanges.txt of tess4j:
Version 4.0.0 (April 2018)
- Upgrade to Tesseract 4.0.0-beta.1 (45bb942)
-Update lept4j to 1.9.3 (Leptonica 1.75.3)
Version 4.0.1 (2 May 2018)
-Fix A path issue when extracting resources from JAR to temp directory on Windows server
Version 4.0.2 (3 May 2018)
-Replace JNA string constant Platform.resource_prefix
-Update Jai-imageio URL
-Update lept4j to 1.9.4
As you can see, the latest version of TESS4J, which only Tesseract 4.0.0-beta.1, has produced a combination of the versions described above. Since it is not possible to determine which of the pre-built packages are built from Tesseract 4.0.0-beta.1, they can only be built through the source code itself.
Build Tesseract1 Modify Yum's repo
Perhaps my environment network is very poor, Yum uses mirror by default, but the vast majority of mirror are not connected, which can cause the download process to be done in a large number of invalid mirror attempts, which is a waste of time.
Therefore, I closed the yum fast Mirror plugin (/ETC/YUM/PLUGINCONF.D) and modified the Centos-base.repo.
2 Installing Prerequisite Packages
yum -y update
yum -y install libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel
yum group install -y "Development Tools"
Note: Autoconf-archive is an essential component not mentioned in the official note.
3 Download the source code
Leptonica
Http://www.leptonica.com/source/leptonica-1.76.0.tar.gz
Tesseract
Https://codeload.github.com/tesseract-ocr/tesseract/tar.gz/4.0.0-beta.1
Download Complete effect:
4 Installing Leptonica
tar -zxvf leptonica-1.76.0.tar.gz
cd leptonica-1.76.0
./autobuild
./configure
make -j
make install
5 Installing Tesseract
tar -zxvf 4.0.0-beta.1.tar.gz
cd tesseract-4.0.0-beta.1/
./autogen.sh
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j
make install
ldconfig
6 Confirm Installation
After a lengthy compilation process, the tesseract has been installed.
Execute the following command:
Tesseract-v
The contents of the display are as follows:
The installation is complete.
Get so Library
In/usr/local/lib, you can find the dependent library libtesseract.so that tess4j requires. Visible libtesseract.so actually points to libtesseract.so.4.0.0. Liblept.so is the Leptonica library, and Tesseract also needs to be called.
Set the library (so) file location for tess4j
The Tess4j jar package contains the DLL libraries for Windows, which are located in the following locations:
When the tess4j is running, the library that the operating system relies on is extracted from the jar package and used for the so file of the Linux system. The TESS4J agreed Linux library file storage path is linux-x86-64 under the classpath root path.
Now that you've got the so file, it's obvious that there are two ways:
1 Modify the Tess4j-4.0.2.jar to store so files in the agreed path so that the tess4j is automatically decompressed at run time.
However, this approach modifies the tess4j publicly released Jar package, which is problematic for future upgrades, so this is not recommended.
2 Place the linux-x86-64 in the Java project's classpath root directory. This way, tess4j can find the library when it is running. My Java project was built with Gradle, so I put these files under Src/main/resources.
The files contained in the linux-x86-64 directory are/usr/local/lib copied from the Linux system, and some linked files are removed, as follows:
At this point, you have completed the preparation of tess4j in the library file.
Set tesseract Data directory in tess4j Tessdata
Tesseract runtime is required to load the language training data, according to the Convention, these training data need to be placed under Tessdata. However, the tess4j 4.0.2 is inconsistent with the Windows and Linux operating system directory processing methods.
Initialize the Tesseract code, Setdatapath is used to set the Tessdata directory.
Itesseract instance = new Tesseract ();//Set Tessdata directory Instance.setdatapath ("/path/to/tessdata");
The training data for Tesseract is named "Language name. Traineddata".
after testing, in Windows, you need to specify directly to the. Traineddata directory, in Linux, you need to specify to a directory, which contains a folder called Tessdata, Tessdata Internal is the. traineddata file.
To illustrate:
In Windows, Instance.setdatapath ("Lngdata/tessdata");
In Linux, Instance.setdatapath ("Lngdata");
This is obviously a bug, but the open source project has bugs that are commonplace.
To do this, I made a small adaptation to the operating system in practice.
ITesseract instance = new Tesseract();
File tessTrainedDataLoc = null;
If(SystemDetector.isWindows())
{
/ / Windows Data directory directly assigned to the directory where *.traineddata is located
tessTrainedDataLoc = new File(System.getProperty("user.dir"),"lngData\\tessdata");
}
Else
{
// In Linux (such as CentOS 7), the Data directory is assigned to the tessdata level.
tessTrainedDataLoc = new File(System.getProperty("user.dir"),"lngData");
}
instance.setDatapath(tessTrainedDataLoc.getAbsolutePath());
The Systemdetector code used above:
import java.util.Properties;
public class SystemDetector {
private static boolean isWindows = false;
private static boolean isLinux = false;
static {
Properties props = System.getProperties();
String systemName = props.getProperty("os.name");
if (systemName.toLowerCase().indexOf("windows") != -1) {
isWindows = true;
}
if (systemName.toLowerCase().indexOf("linux") != -1) {
isLinux = true;
}
}
public static boolean isWindows()
{
return isWindows;
}
public static boolean isLinux()
{
return isLinux;
}
}
This is all done. Projects developed on Windows that use tess4j can already run properly in Linux.
Training data on Tesseract
Tesseract's biggest advantage is that it can be used out of the box, with a large number of language training data, in practice, you can use the OCR recognition of the content type to add.
But do not too much, the wider the scope of recognition, the slower the speed, and even affect the accuracy. It is recommended that you specify the language and type of content to be recognized when you use it in order to strike an appropriate balance between accuracy and efficiency.
Tesseract training data can be obtained from the following:
Https://github.com/tesseract-ocr/tessdata
Tesseract 4 uses lstm, so there is also a repo called tessdata_best, whose content is the training data with the highest rate of recognition in all languages trained using the LSTM model. (Recommended use)
Https://github.com/tesseract-ocr/tessdata_best
If you are not satisfied with the tesseract effect, you can also prepare your own data for training. Tesseract all items are on GitHub and the link address is:
Https://github.com/tesseract-ocr
"End of the full text"
Welcome reprint, please indicate the source.
tess4j Linux Practice [FIX: Tess4j-native library (linux-x86-64/libtesseract.so) not found in resource path]