Trained tesseract on thin gold body successfully!!

Source: Internet
Author: User
Tags character set tesseract ocr

Successful training tesseract identification of thin gold body (24 words:))

--Interrupted--
Very good training files from the Early modern OCR Project (EMOP). It should be primalabs 's job.
Tesseracttraining
Testing with Tesseract

A lot of resources online. A few steps are clearly expressed clearly:
The Jtessboxeditor tool is used to train the Tesseract3.02.02 sample, and the identification rate of the verification code is mentioned in the merged sample picture. multi-page TIFF file. Attention:Don't MIX FONTS in an IMAGE file (in a single. tr file to be precise.)With the console result of the output intermediate process, you can control the output of your training, very useful to use the shapeclustering (some tutorials do not use this function as if there is no problem.) Finally, there is a list of steps:
1. Merging pictures
2. Generate Box File
Tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0-l ENG-PSM 7 batch.nochop Makebox
3. Modify the Box file
4. Generate Font_properties
Echo Fontyp 0 0 0 0 0 >font_properties
5. Generate Training Files
Tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0-l ENG-PSM 7 Nobatch Box.train
6. Generating Character Set files
Unicharset_extractor Langyp.fontyp.exp0.box
7. Create a shape file
Shapeclustering-f font_properties-u unicharset-o Langyp.unicharset langyp.fontyp.exp0.tr
8. Generate a clustered character signature file
Mftraining-f font_properties-u unicharset-o Langyp.unicharset langyp.fontyp.exp0.tr
9. Generate character Normalization signature file
Cntraining langyp.fontyp.exp0.tr
10. Renaming
Rename Normproto Fontyp.normproto
Rename Inttemp fontyp.inttemp
Rename Pffmtable fontyp.pffmtable
Rename Unicharset Fontyp.unicharset
Rename Shapetable fontyp.shapetable
11. Merging training files to generate Fontyp.traineddata
Combine_tessdata Fontyp.the results of my trainingThe 5th step above, the really important step. It doesn't matter what language the official website says. The test found that the default English seems a bit more wrong.
tesseract.exe chi.slimqjs.exp0.tif chi.slimqjs.exp0 nobatch box.train
Tesseract Open Source OCR Engine V3.05.00dev with Leptonica
Page 1
Row xheight=86, but median xheight = 0.5
Row xheight=77, but median xheight = 0.5
Row xheight=68, but median xheight = 0.5
Row xheight=77, but median xheight = 0.5
fail!
Apply_boxes:boxfile Line 1/li ((73,490), (185,582)): failure! Couldn ' t find a matching blob
fail!
Apply_boxes:boxfile Line 2/Chicken ((226,480), (355,600)): failure! Couldn ' t find a matching blob
fail!
Apply_boxes:boxfile Line 5/Dew ((677,478), (803,603)): failure! Couldn ' t find a matching blob
fail!
Apply_boxes:boxfile Line 6/Qui-gon ((828,480), (951,592)): failure! Couldn ' t find a matching blob
Apply_boxes:
Boxes read from boxfile:24
Boxes failed Resegmentation:4
apply_boxes:unlabelled word at:bounding box= (54,610), (974,611)
Found good blobs.
Leaving unlabelled blobs in 0 words.
1 remaining unlabelled words deleted.
Generated Training data for words

Finally choose to use Chi_sim to train.
tesseract.exe chi.slimqjs.exp0.tif chi.slimqjs.exp0-l chi_sim nobatch box.train
Tesseract Open Source OCR Engine V3.05.00dev with Leptonica
Page 1
Row xheight=29.5, but median xheight = 58.1667
Row xheight=460, but median xheight = 58.1667
Row xheight=117.5, but median xheight = 58.1667
Row xheight=117.5, but median xheight = 58.1667
Row xheight=121, but median xheight = 58.1667
Row xheight=121, but median xheight = 58.1667
Row xheight=28.6667, but median xheight = 58.1667
Row xheight=44.6667, but median xheight = 58.1667
Row xheight=81.3333, but median xheight = 58.1667
Row xheight=29, but median xheight = 58.1667
Apply_boxes:
Boxes read from boxfile:24
apply_boxes:unlabelled word at:bounding box= ( -611,961) (0,1020)
apply_boxes:unlabelled word at:bounding box= ( -610,54), (-609,974)
apply_boxes:unlabelled word at:bounding box= (398,183), (459,250)
apply_boxes:unlabelled word at:bounding box= (371,184), (400,262)
apply_boxes:unlabelled word at:bounding box= ( -611,0) (0,58)
Found good blobs.
Leaving 3 unlabelled blobs in 0 words.
5 remaining unlabelled words deleted.

Shapeclustering.exe There are many bad properties errors

The results are similar in the following
shapeclustering.exe-f font_properties-u unicharset chi.slimqjs.exp0.tr
Reading chi.slimqjs.exp0.tr ...
Bad properties for index 3, char: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 4, Char Phoenix: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 5, char from: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 6, char cloud: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 7, Char Dew: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 8, char kui: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 9, char sharp: 0,255 0,255 0,0 0,0 0,0
Bad properties for index ten, char medullary: 0,255 0,255 0,0 0,0 0,0
Bad properties for index one, Char green: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char heap: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char and: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, Char point: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char dust: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char light: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char smoke: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char fetch: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, Char will: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char move: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char ramming: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, Char branch: 0,255 0,255 0,0 0,0 0,0
Building Master Shape table
Computing shape distances ...
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ...
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ...
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ...
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ...
Stopped with 0 merged, Min Dist 999.000000
Computing shape distances ... 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Stopped with 0 merged, Min Dist 0.348285
Master Shape_table:number of shapes = unichars max = 1 Number with multiple Unichars = 0

Mftraining.exe is also the same bad properties error, plus two x Warning
Mftraining.exe-f font_properties-u unicharset-o Chi.unicharset chi.slimqjs.exp0.tr
Read shape table shapetable of shapes
Reading chi.slimqjs.exp0.tr ...
Bad properties for index 3, char: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 4, Char Phoenix: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 5, char from: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 6, char cloud: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 7, Char Dew: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 8, char kui: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 9, char sharp: 0,255 0,255 0,0 0,0 0,0
Bad properties for index ten, char medullary: 0,255 0,255 0,0 0,0 0,0
Bad properties for index one, Char green: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char heap: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char and: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, Char point: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char dust: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char light: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char smoke: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char fetch: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, Char will: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char move: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, char ramming: 0,255 0,255 0,0 0,0 0,0
Bad properties for index, Char branch: 0,255 0,255 0,0 0,0 0,0
Warning:no protos/configs for Joined in Createinttemplates ()
Warning:no Protos/configs for | Broken|0|1 in Createinttemplates ()
done!

Finally succeeded.
Combine_tessdata.exe Chi.
Combining Tessdata Files
Tessdatamanager combined tesseract data files.
Offset for type 0 (chi.config) is-1
Offset for Type 1 (chi.unicharset) is 140
Offset for Type 2 (chi.unicharambigs) is-1
Offset for Type 3 (chi.inttemp) is 1661
Offset for Type 4 (chi.pffmtable) is 200388
Offset for Type 5 (Chi.normproto) is 200668
Offset for type 6 (CHI.PUNC-DAWG) is-1
Offset for Type 7 (CHI.WORD-DAWG) is-1
Offset for type 8 (CHI.NUMBER-DAWG) is-1
Offset for Type 9 (CHI.FREQ-DAWG) is-1
Offset for Type ten (CHI.FIXED-LENGTH-DAWGS) is-1
Offset for Type one (Chi.cube-unicharset) is-1
Offset for Type (CHI.CUBE-WORD-DAWG) is-1
Offset for Type 203778 (chi.shapetable)
Offset for Type (CHI.BIGRAM-DAWG) is-1
Offset for Type (CHI.UNAMBIG-DAWG) is-1
Offset for Type (Chi.params-model) is-1
Output Chi.traineddata created successfully.

Character recognition in tesseract training if there are multiple image files, multiple box files, etc., need to be merged into 1 character collections, and the following training files

TESSERACT-OCR recognition Chinese and training font example also in the fifth step there is a recognition error

The simple use and training of TESSERACT-OCR no special place

Training tesseract OCR for a New Font and Input Set on Mac does not have a half-dime relationship with Mac. The most used is to mention the font authentication site identifont If the font file installed, you can directly use the Tiff/box Generator to generate pictures and Box files (through)

Testing with Tesseract

How to prepare training files for tesseract OCR and improve characters recognition?

The following three articles are good
Training tesseract for labels, receipts and such

A Guide to OCR with Tesseract 3.03

Adding New Fonts to tesseract 3 OCR Engine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.