First, TESSERACT-OCR is what
an OCR Engine that is developed at HP Labs between 1985 and 1995 ... and no W at Google
based on the Leptonica (http://leptonica.com/) graphics processing library open source graphic recognition engine.
Support Linux, Windows, MAC platforms,
Support. NET, C + +, Python, Java, and other development languages: Https://code.google.com/p/tesseract-ocr/wiki/AddOns
Project Address: https://code.google.com/p/tesseract-ocr/
Two, using method
download installation: https:// Tesseract-ocr.googlecode.com/files/tesseract-ocr-setup-3.02.02.exe
Note The path directory, mathematical symbols, and language options when you install, and select on demand.
Execute: Tesseract yourpic.png res
Picture yourpic.png contents are identified and stored in Res.txt
For more precise identification you can go to the project address to download the corresponding languages language Tessdata
Example:
Simplified Chinese https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.chi_sim.tar.gz
Traditional Chinese
Download chi_sim.traineddata Copy to Tesseract-ocr\tessdata after decompression
:
tesseract yourpic.png ENG Use default Eng language Pack
Tess Eract yourpic.png sim-l Chi_sim "Use Chi_sim language packs
tesseract yourpic.png tra-l Chi_tra" Use Chi_sim language packs
To select the closest real data, convenient Fix
Later
Third, advanced use training
A handful of training TESSERACT-OCR Chinese documents
Http://yy-programer.blogspot.tw/2012/08/training-tesseract-ocr-301.html
For high-precision needs of the need to study, the daily civilian level, the default identification plus late correction can be.
Iv. adsorption agents for application examples
Agent adsorption for several proxy list pages for http://www.proxyfire.net/
Don't say much directly on the code,
Pf.bat
Copy Code code as follows:
pf.pl http://www.proxyfire.net/index.php?pageid=eliteproxylist Elite.txt
pf.pl http://www.proxyfire.net/index.php?pageid=anonymousproxylist Anony.txt
pf.pl http://www.proxyfire.net/index.php?pageid=transparentproxylist Trans.txt
pf.pl http://www.proxyfire.net/index.php?pageid=socks4proxylist S4.txt
pf.pl http://www.proxyfire.net/index.php?pageid=socks5proxylist S5.txt
Type *.txt > All.tmp
Del *.txt/s/q
ren all.tmp all.txt
@pause
pf.pl
Copy Code code as follows:
Use strict;
Our $url = $ARGV [0];
Our $file = $ARGV [1];
my $res = undef;
my @tmp = undef;
my @pxy = undef;
' Wget $url-Q-o ___html ';
Open FH, "<___html";
@tmp =;
Close FH;
$res = Join (", @tmp);
Undef (@tmp);
' del ___html/s/Q ';
@tmp = ($res =~/]+") ><\/td> (\d+) ' http://www.proxyfire.net '. $tmp [$i], ' Port ' => $tmp [$i +1]};
$i = $i + 1;
}
For (my $i =0; $i < @pxy; $i + +) {if (Length (${$pxy [$i]}{ip}) >0)
{
' Echo off & wget ${$pxy [$i]}{ip}-q-o ___png ';
' Tesseract ___png ___-L Chi_tra ';
my $txt = undef;
Open FH, "<___.txt";
$txt =;
Close FH;
if (length ($txt) >11)
{
$txt =~ s/\s+//g;
$txt =~ s/Day/8/g;
$txt =~ s/昍/88/g;
$txt =~ s/s0/60/g;
$txt =~ s/s1/61/g;
$txt =~ s/s2/62/g;
$txt =~ s/s3/69/g;
$txt =~ s/s4/64/g;
$txt =~ s/s5/65/g;
$txt =~ s/s7/67/g;
$txt =~ s/s8/68/g;
$txt =~ s/s9/69/g;
$txt =~ s/0s/06/g;
$txt =~ s/1s/16/g;
$txt =~ s/2s/26/g;
$txt =~ s/3s/96/g;
$txt =~ s/4s/46/g;
$txt =~ s/5s/56/g;
$txt =~ s/6s/66/g;
$txt =~ s/7s/76/g;
$txt =~ s/8s/86/g;
$txt =~ s/9s/96/g;
$txt =~ s/ss/66/g;
$txt =~ s/\.s/\.6/g;
${$pxy [$i]}{ip} = $txt;
My $bak 1 = $txt;
My $bak 2 = $txt;
$bak 1 =~ s/13/19/g;
$bak 1 =~ s/\.32\./\.92\./g;
$bak 1 =~ s/\.33\./\.99\./g;
$bak 2 =~ s/19/13/g;
$bak 2 =~ s/\.243/\.249/g;
$bak 2 =~ s/203\./209\./g;
Open Fhx, ">> $file";
Print Fhx ${$pxy [$i]}{ip}. ":". ${$pxy [$i]}{port}.] \ n ";
Print Fhx $bak 1. ":". ${$pxy [$i]}{port}. " \ n ";
Print Fhx $bak 2. ":". ${$pxy [$i]}{port}. " \ n ";
Close Fhx;
}
my $txt = undef;
}
}
' Del ___*/s/q ';
Undef ($url);
Undef ($file);
Undef ($res);
Undef (@tmp);
Undef (@pxy);