TESS4J is the Java JNA Encapsulation of the tesseract OCR API. Enables Java to use Tesseract OCR by invoking the TESS4J API. Supported formats: Tiff,jpeg,gif,png,bmp,jpeg,and PDF
Tesseract's github address: https://github.com/tesseract-ocr/tesseract
TESS4J's github address: https://github.com/nguyenq/tess4j
Features provided by the tess4j API:
1. Direct identification of supported documents
2. Identify picture stream
3. Identify an area of the image
4. Save the recognition result as Text/hocr/pdf/unlv/box
5, by setting the level of the word to extract the recognized text
6. Obtain the specific coordinate range of each recognition area
7, adjust the tilt of the picture
8. Crop the picture
9. Adjust the image resolution
10. Get the image from the adhesive board
11. Clone an image (Purpose: Create an identical image, with the original in the operation of the modification, do not affect each other)
12, image conversion to binary, black and white image, grayscale image
13. Invert Picture Color
Demo.java:
/** * Test of DOOCR method, of class tesseract. * Identification according to the image file *@throws Exception while processing image. */@TestPublicvoidTestdoocr_file()Throws Exception {Logger.info ("DOOCR on a JPG image"); File ImageFile =New File (This.testresourcesdatapath,"Ocr.png");Set Language Instance.setdatapath (Testresourceslanguagepath); Instance.setlanguage ("Chi_sim"); String result = INSTANCE.DOOCR (ImageFile); Logger.info (result); }/** * Test of DOOCR method, of class tesseract. * Identification according to the picture stream *@throws Exception while processing image. */@TestPublicvoidTestdoocr_bufferedimage()Throws Exception {Logger.info ("DOOCR on a buffered image of a PNG"); File ImageFile =New File (This.testresourcesdatapath,"Ocr.png"); BufferedImage bi = imageio.read (imagefile);Set Language Instance.setdatapath (Testresourceslanguagepath); Instance.setlanguage ("Chi_sim"); String result = INSTANCE.DOOCR (BI); Logger.info (result); }/** * Test of Getsegmentedregions method, of class tesseract. * Get the specific coordinates of each division area *@throws java.lang.Exception * *@TestPublicvoidTestgetsegmentedregions()Throws Exception {Logger.info ("Getsegmentedregions at given Tesspageiteratorlevel"); File ImageFile =New File (Testresourcesdatapath,"Ocr.png"); BufferedImage bi = imageio.read (imagefile);int level = Tesspageiteratorlevel.ril_symbol; Logger.info ("Pageiteratorlevel:" + utils.getconstantname (level, tesspageiteratorlevel.class)); list<rectangle> result = instance.getsegmentedregions (bi, level);for (int i =0; I < result.size (); i++) {Rectangle rect = result.get (i); Logger.info (String.Format ("box[%d]: x=%d, y=%d, w=%d, h=%d", I, Rect.x, Rect.y, Rect.width, rect.height)); } asserttrue (Result.size () >0); }/** * Test of DOOCR method, of class tesseract. * Identification according to the defined coordinate range *@throws Exception while processing image. */@TestPublicvoidTestdoocr_file_rectangle()Throws Exception {Logger.info ("DOOCR on a BMP image with bounding rectangle"); File ImageFile =New File (This.testresourcesdatapath,"Ocr.png");Set the language library Instance.setdatapath (Testresourceslanguagepath); Instance.setlanguage ("Chi_sim");Delineation of areasX, Y is the origin of the upper-left corner, width and height are xy-based Rectangle rect =New Rectangle (84,21st15,13); String result = INSTANCE.DOOCR (ImageFile, rect); Logger.info (result); }/** * Test of Createdocuments method, of class tesseract. * Store Results *@throws java.lang.Exception * *@TestPublicvoidTestcreatedocuments()Throws Exception {Logger.info ("Createdocuments for PNG"); File ImageFile =New File (This.testresourcesdatapath,"Ocr.png"); String outputbase ="Target/test-classes/docrenderer-2"; list<renderedformat> formats =New Arraylist<renderedformat> (Arrays.aslist (RENDEREDFORMAT.HOCR, Renderedformat.text));Set the language library Instance.setdatapath (Testresourceslanguagepath); Instance.setlanguage ("Chi_sim"); Instance.createdocuments (New String[]{imagefile.getpath ()},New String[]{outputbase}, formats); }/** * Test of Getwords method, of class tesseract. * Method of taking words *@throws java.lang.Exception * *@TestPublicvoidTestgetwords()Throws Exception {Logger.info ("Getwords"); File ImageFile =New File (This.testresourcesdatapath,"Ocr.png");Set the language library Instance.setdatapath (Testresourceslanguagepath); Instance.setlanguage ("Chi_sim");Follow each wordint pageiteratorlevel = Tesspageiteratorlevel.ril_symbol; Logger.info ("Pageiteratorlevel:" + utils.getconstantname (Pageiteratorlevel, Tesspageiteratorlevel.class)); BufferedImage bi = imageio.read (imagefile); list<word> result = Instance.getwords (bi, pageiteratorlevel);Print the complete resultfor (Word Word:result) {logger.info (word.tostring ());}}/** * Test of Invalid memory access. * Processing Tilt *@throws Exception while processing image. */@TestPublicvoidTestdoocr_skewedimage()Throws Exception {//set the language library Instance.setdatapath (Testresourceslanguagepath); Instance.setlanguage ( "Chi_sim"); Logger.info ( "DOOCR on a skewed PNG image"); File ImageFile = new File (this.testresourcesdatapath, "ocr_skewed.jpg"); BufferedImage bi = imageio.read (imagefile); Imagedeskew id = new Imagedeskew (BI); double imageskewangle = Id.getskewangle (); //determine skew angle if (Imageskewangle > MINIMUM_DESKEW _threshold | | Imageskewangle <-(minimum_deskew_threshold)) {bi = imagehelper.rotateimage (bi,-imageskewangle); //deskew image} String result = INSTANCE.DOOCR (BI); Logger.info (result); }
Tess4jdemo Code Cloud Address: Https://gitee.com/zhaohuihbwj/Tess4JDemo
Java OCR text recognition (TESS4J)October 17, 2017 10:11:10Hits: 6372
OCR (Optical Character recognition, optical character recognition) refers to an electronic device (such as a scanner or digital camera) that examines the printed characters on a paper, determines its shape by detecting dark and bright patterns, and then translates the shape into computer text using a character recognition method; For the printed characters, the text in the paper document is converted into a black-and-white bitmap image file by the optical method, and the text in the image is converted into a text format by the recognition software, which is used for further editing of the processing technology by the word processing software. It is the most important subject for OCR to improve the recognition accuracy by using auxiliary information, and the Intelligent Character recognition noun is produced. The main indicators to measure the performance of an OCR system are: Rejection rate, error rate, recognition speed, user interface friendliness, product stability, ease of use and feasibility.
TESS4J is an implementation of the Java library for Google Tesseract OCR
1.maven Add Dependency
<!--https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j--><dependency> < groupid>net.sourceforge.tess4j</groupid> <artifactId>tess4j</artifactId> < Version>3.2.1</version></dependency>
2. Tool-Type editing
/*** Tesseract for Java,OCR(Optical Character Recognition, optical character recognition)* Tool Class* @author Wind */public class Tess4jutils {/** * extract text from a picture,default settings English font,UseClasspathTraining Library under the catalogue* @paramPath* @return*/public static string ReadChar (string path) {//JNA Interface Mapping itesseract instance = new Tesseract (); JNA Direct Mapping//Itesseract instance = new Tesseract1 (); File ImageFile = new file (path); In cases you don't have your own tessdata, let it also is extracted for you//so you can use the training library under the Classpath directory File Tessdatafolder = loadlibs.extracttessresources ("Tessdata"); Set the Tessdata path Instance.setdatapath (Tessdatafolder.getabsolutepath ()); English library recognition numbers are more accurate instance.setlanguage (Const.ENG); Return Getocrtext (instance, imagefile); }/** * extract text from a picture* @paramPathPicture Path* @paramDataPathTraining Library Path* @paramlanguageLanguage Font* @return*/public static string ReadChar (string path, string DataPath, String language) {file ImageFile = new file (path); Itesseract instance = new Tesseract (); Instance.setdatapath (DataPath); English library recognition numbers are more accurate instance.setlanguage (language); Return Getocrtext (instance, imagefile); }/** * recognize text in a picture file* @paraminstance* @paramImageFile* @return*/private static string Getocrtext (itesseract instance, File imagefile) {string result = NULL; try {result = INSTANCE.DOOCR (ImageFile); } catch (Tesseractexception e) {e.printstacktrace (); } return result; } public static void Main (string[] args) {/*string path = "Src/main/resources/image/text.png"; System.out.println (ReadChar (path)); */String ch = "Src/main/resources/image/ch.png"; System. out. println (ReadChar (CH, "src/main/resources", Const.)Chi_sim)); }}
Note: Chinese results are not accurate and need to train their own font
Specific training font, and complete code please visit https://github.com/followwwind/javautils
Java uses tess4j for OCR recognition