How to do Japanese OCR with Tesseract (C # Implementation)

Source: Internet
Author: User

First do a background introduction, Tesseract is an open-source OCR component, mainly for the print body text recognition, handwriting recognition ability is poor, support multi-lingual (Chinese, English, Japanese, Korean, etc.). is the strongest OCR component in the open source world. Of course, compared with the world's strongest OCR tool ABBYY, there is still a little gap, especially when the picture quality is poor, the gap is still obvious.

There are a lot of introductions on how to use this component on the Internet, but they are all for English recognition. And if the Chinese or Japanese and other characters to identify, in addition to the need to use a different language pack, but also to do some special tesseract settings, otherwise the recognition rate will be very low, I would like to share with you I use the tesseract of Japanese to do some OCR experience.

The first step is to download the Tesseract component, the simplest way is to use the VisualStudio nuget to download. Select the first component.

The second step, download the Japanese language pack, because in the mainland region cannot access Google, so can not open the official website directly download language packs. I give the address of the file, can use thunder download.

Http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.jpn.tar.gz

After the download is complete, unzip the language pack file and place it in the tessdata folder.

So far, the preparation is ready and you can start writing your code.

The third step is to initialize the Tesseract component with the following code.

New Tesseractengine (@ "tessdata folder path ""JPN", Enginemode.default ))

The fourth step, set the OCR parameters, the interpretation of the parameters, you can refer to the official website

Useful parameters for Japanese and Chinese

Some Japanese tesseract user found these parameters helpful for increasing TESSERACT-OCR (3.02) accuracy for Japanese:

Name Suggested value Description
Chop_enable T Chop Enable.
Use_new_state_cost F Use the new state cost heuristics-segmentation State evaluation
Segment_segcost_rating F Incorporate segmentation cost in Word rating?
Enable_new_segsearch 0 Enable New Segmentation search path. It could solve the problem of dividing one character to both characters
language_model_ngram_on 0 Turn on/off the use of character Ngram model.
Textord_force_make_prop_words F Force proportional Word segmentation on all rows.
Edges_max_children_per_outline 40 Max number of children inside a character outline. Increase this value if some of KANJI characters is not recognized (rejected).

Here is the code   

Engine. SetVariable ("chop_enable","F"); engine. SetVariable ("Enable_new_segsearch",0); engine. SetVariable ("Use_new_state_cost","F"); engine. SetVariable ("segment_segcost_rating","F"); engine. SetVariable ("language_model_ngram_on",0); engine. SetVariable ("Textord_force_make_prop_words","F"); engine. SetVariable ("Edges_max_children_per_outline", -);

This inside chop_enable parameters and the official website recommended not too, I found that according to the official website settings, there will be many words can not be recognized.

Fifth step, begin to identify.

var page = engine. Process (p); var testtext = page. GetText (); var c=page. Getmeanconfidence ();

The first line of code returns a Page object that can obtain the recognized text, and also the location of the recognized text (this is useful for identifying non-fixed schema documents, which can be dynamically found to identify field locations depending on the keyword).
In the example of OCR to do full-text recognition, but to do full-text recognition in many cases, the recognition of quality generally, it is better to increase the recognition area parameters, while the pagesegmode parameter is set to Pagesegmode.singleblock (for text that is the same size as multiline) or Pagesegmode.singlerow (represents the same line size).

The second and third rows return the recognized text and the recognized trust level, respectively. In practice, I found it was not particularly useful to identify trust degrees. Regardless of the identification of the right and wrong, the trust level is basically around 0.7, some times the trust is higher, the recognition result is wrong.

After these steps, you can complete the OCR in Japanese. But for the above code to run successfully, you must also install VC + + Run time 2012, otherwise it will error.

I used the above method to test the scanned image, found that the recognition accuracy is relatively high, especially after the specified area and the pagesegmode parameter. But the Japanese font also has some low-level errors, such as the number "1" identified as the Chinese character "one" and so on. If you want to solve this problem, you must train the Japanese from the beginning, this workload is very large! And this really is tesseract a very not intelligent place, should support in the original training font based on the addition of training content! or provide box files and training TIF on the official website for developers to download.

How to do Japanese OCR with Tesseract (C # Implementation)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.