58 The phone number of the same city is mostly picture format, the purpose is also to prevent crawler software crawl, but as a special development crawler ape, must take it to peace of mind, or sleep dream will also think how to break this damn picture Number!
Here we take advantage of Google's Open source project: TESSERACT-OCR (Project address: HTTPS://GITHUB.COM/TESSERACT-OCR)
In fact, tesseract online tutorial actually has a lot, about it's introduction, I don't say here, direct focus!
First to initialize the tesseract, here we use the default identification library, according to the characteristics of the 58 city number picture, we initialize the following:
The program needs to refer to: Tesseract.dll and the program root directory to have tessdata\\eng.traineddata of the identification library file
Tesseract.tesseractengine te = new Tesseractengine (Application.startuppath + "\\tessdata", "Eng", Enginemode.default) ;//Initialize, use the default recognition library here
Te. SetVariable ("Tessedit_char_whitelist", "0123456789");//Set the recognized Word Fu Bai list
Te. Defaultpagesegmode = pagesegmode.singleline;//Sets the recognition mode to single-line mode
The note is. NET version seems to have to be 3.5 and above, otherwise tesseract initialization always does not pass. This problem has plagued me for a long time.
First we get the 58 picture number on the same city as the following address:
Http://image.58.com/showphone.aspx?t=v55&v=6E0C227B5A963FC4VD7B70A4FC12D1D01
Download get get the following picture:
First the image binary algorithm (is turned into only black and white algorithm, search engine a bunch) to get the following image:
This monochrome image is much friendlier to the OCR engine and recognizes the algorithm:
We set a bitmap type variable btelimg store this binary number picture, String type Stelnumber used to save the number of recognition results, using the following algorithm to obtain the recognition result:
Page PG = te. Process (Pixconverter.topix (btelimg), pagesegmode.singleline);
Stelnumber= Keyreplace (PG. GetText ());
Under Recognition:
The amount ... 10,000 Grass mud horse galloping, wrong so much to do? Do you want to use Tesseract's advanced training algorithm to train your own library? All say simple identification, don't make so complicated good, I am lazy!
In fact, 58 on this image is generated dynamically, so each access to get the picture is different, including the number interval. The first download picture because the character picture adhesion problem, causes the recognition result is incorrect, we the same address again to download the picture again:
Binary Value:
Recognition:
Haha, finally right!
Prove that this free OCR engine directly downloaded without complex training or effective, the following we do not change the recognition algorithm under the premise of improving the recognition rate (after all, for this pure digital picture, want to OCR recognition rate is high, can only train or write a dedicated OCR engine)
Because the first recognition result is wrong, the second time to download the picture, the results are correct. So we can start from the recognition results, not to re-download the image, re-recognition, until the correct or set a threshold, to reach the threshold, incorrect I can not do! This is the only way for free!
Because here we recognize the mobile phone number, so know the law of mobile phone number, we will judge the results, we can initially determine the results of the fight!
The mobile phone number must be a 11-bit pure number (because we set the whitelist to be a pure number, so guarantee the result is 11 bit on it)
The mobile phone number must start with 13,15,18 (this can exclude a large part of the error)
Well, almost these two result rules can improve the recognition rate effectively. Algorithm I don't write here, it's a program ape, right?
At this point, a simple 58 mobile phone number picture is finished. Other, such as phone numbers, simple character verification code, the principle is similar. Hope to give beginners a little help, the next opportunity to talk to you about more advanced OCR recognition method.
Using Tesseract to identify 58 of the same city picture-type mobile phone number