Using Tesseract to identify 58 of the same city picture-type mobile phone number

Source: Internet
Author: User

58 The phone number of the same city is mostly picture format, the purpose is also to prevent crawler software crawl, but as a special development crawler ape, must take it to peace of mind, or sleep dream will also think how to break this damn picture Number!

Here we take advantage of Google's Open source project: TESSERACT-OCR (Project address: HTTPS://GITHUB.COM/TESSERACT-OCR)

In fact, tesseract online tutorial actually has a lot, about it's introduction, I don't say here, direct focus!

First to initialize the tesseract, here we use the default identification library, according to the characteristics of the 58 city number picture, we initialize the following:

The program needs to refer to: Tesseract.dll and the program root directory to have tessdata\\eng.traineddata of the identification library file

Tesseract.tesseractengine te = new Tesseractengine (Application.startuppath + "\\tessdata", "Eng", Enginemode.default) ;//Initialize, use the default recognition library here

Te. SetVariable ("Tessedit_char_whitelist", "0123456789");//Set the recognized Word Fu Bai list

Te. Defaultpagesegmode = pagesegmode.singleline;//Sets the recognition mode to single-line mode

The note is. NET version seems to have to be 3.5 and above, otherwise tesseract initialization always does not pass. This problem has plagued me for a long time.

First we get the 58 picture number on the same city as the following address:

Http://image.58.com/showphone.aspx?t=v55&v=6E0C227B5A963FC4VD7B70A4FC12D1D01

Download get get the following picture:

First the image binary algorithm (is turned into only black and white algorithm, search engine a bunch) to get the following image:

This monochrome image is much friendlier to the OCR engine and recognizes the algorithm:

We set a bitmap type variable btelimg store this binary number picture, String type Stelnumber used to save the number of recognition results, using the following algorithm to obtain the recognition result:

Page PG = te. Process (Pixconverter.topix (btelimg), pagesegmode.singleline);

Stelnumber= Keyreplace (PG. GetText ());

Under Recognition:

The amount ... 10,000 Grass mud horse galloping, wrong so much to do? Do you want to use Tesseract's advanced training algorithm to train your own library? All say simple identification, don't make so complicated good, I am lazy!

In fact, 58 on this image is generated dynamically, so each access to get the picture is different, including the number interval. The first download picture because the character picture adhesion problem, causes the recognition result is incorrect, we the same address again to download the picture again:

Binary Value:

Recognition:

Haha, finally right!
Prove that this free OCR engine directly downloaded without complex training or effective, the following we do not change the recognition algorithm under the premise of improving the recognition rate (after all, for this pure digital picture, want to OCR recognition rate is high, can only train or write a dedicated OCR engine)

Because the first recognition result is wrong, the second time to download the picture, the results are correct. So we can start from the recognition results, not to re-download the image, re-recognition, until the correct or set a threshold, to reach the threshold, incorrect I can not do! This is the only way for free!

Because here we recognize the mobile phone number, so know the law of mobile phone number, we will judge the results, we can initially determine the results of the fight!

    1. The mobile phone number must be a 11-bit pure number (because we set the whitelist to be a pure number, so guarantee the result is 11 bit on it)

    2. The mobile phone number must start with 13,15,18 (this can exclude a large part of the error)

Well, almost these two result rules can improve the recognition rate effectively. Algorithm I don't write here, it's a program ape, right?

At this point, a simple 58 mobile phone number picture is finished. Other, such as phone numbers, simple character verification code, the principle is similar. Hope to give beginners a little help, the next opportunity to talk to you about more advanced OCR recognition method.

Using Tesseract to identify 58 of the same city picture-type mobile phone number

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.