In-depth study on the methods of TESSERACT-OCR recognizing Chinese and training fonts

Source: Internet
Author: User
Tags prepare

  The previous article simply learned the English in the TESSERACT-OCR recognition image (the link address is as follows: www.cnblogs.com/wj-1314/p/9428909.html), it looks good, So this article continues in-depth study TESSERACT-OCR recognize the Chinese in the picture.

first, prepare the Chinese font

Download the Chi_sim.traindata font. To have this ability to recognize Chinese. Next, put it in the Tessdata folder of the TESSERACT-OCR project. (Note download the font, be sure to see the library corresponding to the tesseract version download)

  Why the emphasis on the version, the small part of the stupid things I do here to attach, I hope you do not into the pit.

In the previous study TESSERACT-OCR, the recognition is English, then the small series downloaded the medium library, as follows

Do not know what is the reason, always error. The error is as follows:

I have found a variety of methods, including reinstalling the library, configuration environment variables, still does not solve the problem, so here, I also consider the tesseract version of the problem, so I intend to re-update version w64-v4.0.0, continue to try. Attached DOWNLOAD link address

Download Tesseract's address: digi.bib.uni-mannheim.de/tesseract/

Download the TESSERACT-OCR package address: Github.com/tesseract-ocr/tesseract/wiki/data-files

  Download tesseract git address: github.com/tesseract-ocr/tesseract/wiki

After a day of tossing, in Tesseract's GitHub, I stumbled across the problem, can say that they are very stupid, please see

That is, the different version, the installation of the Chinese package is different, and I installed the package, so has been an error, has not solved the problem, next time must not be so careless.

Second, prepare the training font

  Download Jtessboxeditor, this is used to train the font.

The above in Baidu can find the download, not in detail (if not found, you can leave a message to me), after the download is so.

Third, download Java Virtual machine (Java Dafa good)

If you have just touched the Java language and are interested in it, you want to continue to study. Then this verse will tell you how to install the Java Tools JDK, which is your first step in Java.

First you want to download the Java JDK (the full name of the JDK: Java Development Kit is the Java Language Software tool development package), Currently the latest JDK version is 1.8,java, originally sun, because it was later acquired by Oracle, so you need to download the JDK URL on the Oracle website: http://www.oracle.com/technetwork/java/ Javase/downloads/jdk8-downloads-2133151.html. Enter this URL and you'll see the image below.

The reality of the picture above is the Java JDK provided by Oracle, there are two buttons, the default is not to accept license, you need to accept the future can download Java JDK, here you need to according to the type of your computer, as well as the number of operating systems, the corresponding JDK download. The following is a simple example of the demos and samples provided by Oracle for the JDK, which we can learn, and which are interesting to download and learn.

Here I choose to download the JDK for Windows 64-bit JDK, the picture below is the installation package of the downloaded JDK.

Double-click the JDK installation package and click Next.

The public JRE is not installed here because the public JRE is a standalone JRE system, which is installed separately under other paths under Windows system. The common JRE registers the Java Runtime environment with the browser and the system. By registering the runtime environment with the browser and system, any application in the system can use the public JRE. However, there are now few opportunities to execute applets on browser Web pages, and the JRE in the JDK directory is fully qualified, so it is common to choose not to install a public JRE. If you do not want to install it in the default path, you can choose to change the directory.

Click Next and the following installation bar appears.

Enter Java-version to view your Java version. This will make your JDK installation successful.

Four, to identify the Chinese effect 1, casually make a picture with Chinese characters, small made of pictures as follows:

2, the use of Chinese font training, the program is as follows:
Import pytesseractfrom PIL Import image# open captcha picture image = Image.open (' 07.jpg ') #加载一片防止报错, here you can omit the image.load () #调用show来展示图片, Debugging can be omitted here image.show () Text = pytesseract.image_to_string (Image.open (' 07.jpg '), lang = ' Chi_sim ') print (text)

  

3, the result of using Chinese font training is as follows:

From the results, the effect is not ideal, so we want to get better results, then we need to train their own font, the following small series began to train their own font.

Five, Train your own library 1, convert the image to TIF format, used to generate box files later. You can do this by drawing and then saving as TIF.

Change the name of the picture, this is a requirement

TIF face naming format [lang]. [Fontname].exp[num].tiflang is the language           FontName is the font such as we want to train the custom fonts Myfontlab      font name Normal then we rename the picture file Myfontlab.normal.exp0.jpg is turning TIF.

  

2. Generate Box File
Tesseract myfontlab.normal.exp0.jpg myfontlab.normal.exp0-l Chi_sim batch.nochop Makebox

 

The box file and the corresponding TIF must be in the same directory, or the back will not open.

 

 

3. Open Jtessboxeditor correction error and train

  Open Train.bat

Open the TIF file with Jtessboxeditor.jar and modify the box file according to the actual situation

  Find the TIF diagram, open it, and correct it.

4, training, generate. tr files.

Just enter the command at the command line.

Tesseract  myfontlab.normal.exp07.jpg  myfontlab.normal.exp07  nobatch box.train

 

Generate a Unicharset file

Unicharset_extractor Myfontlab.normal.exp07.box

  

In this I have been corrected, but there are still 1 characters can not be recognized, the error of the report with the fact that there is no correlation, do not know is not a bug, to the back of the result is "one" character is not recognized.

5, create a new font_properties file
Content written in normal 0 0 0 0 0 indicates the default normal font

 

Run command

Shapeclustering-f font_properties.txt-u Unicharset myfontlab.normal.exp07.tr

  

Mftraining-f font_properties.txt-u unicharset-o Unicharset myfontlab.normal.exp07.tr

  

Cntraining myfontlab.normal.exp07.tr

  

The following five files are generated under the directory, with normal in front of the five files. renaming

6 Execute Combine_tessdata Normal.

Merge five files, at this time the Normal.traineddata in the directory is a trained font file

Combine_tessdata Normal.

  

The following fonts are trained:

Six test font 1, the normal.traineddata copy to the TESSERACT-OCRT program directory "tessdata" directory, 2, in the TESSERACT-OCRT program directory execution
Tesseract.exe myfontlab.normal.exp07.jpg out–l Normal

The data you have identified is saved in the following file.

This is actually a lot of online information, but most of the description is not detailed and complete, here I step by step to use the TESSERACT-OCR training font method and steps are described, pro-test is no problem.

Seven, how to conduct Tesseract3.02.02 sample training through Jtessboxeditor

Tesseract after generating the. box file, you need to use the Jtessboxeditor tool to correct it, the following is the use of jtessboxeditor steps.

1 loading the. tif file to be corrected

The contents of the box file are also loaded into jtessboxeditor, and if the contents of this section are empty, No. box files are generated! Such as:

2 The loading steps are as follows:

Here is to borrow the pictures of netizens, convenient, if there is infringement, please contact the small part of the timely deletion.

3 Correcting text

When a word is recognized as two, hold down the CTRL key to select two, then click Merge to Merge!

Correction is mainly the coordinate position adjustment, note that the need to select the previous text to separate

4. Ways to remove whitespace

Some gaps may also be jtessboxeditor mistaken for a font, in a blue box,

This can be selected directly, delete the drop is good!

5. Summary

Under normal circumstances, each font will have a blue box, if there are two of the adjacent words are not framed, this time even with the insert after the blue frame, but the final identification or there is a problem, this does not understand whether I operate wrong! Finally found that the original is two fonts are too close, leading to the difference is not open, in the boss's proposal, the distance between two words to separate points, you can normal box out! (If there is a better way, please point out, thank you)

After the modification is complete save! Here I am a picture of a sample to modify, but I do so every one has to do the same correction, do not know if there is a method of batch modification?

Before the picture is trained, it is best to use OPENCV for processing, for example, binary, so you can remove some of the interference! However, it is important to note that the same effect on the image before the recognition of the same processing! This recognition rate will be increased!

Eight software set font method in Setting>font set Chinese font

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.