Using Tesseract to identify 58 of the same city picture-type mobile phone number

Last Update:2017-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

58 The phone number of the same city is mostly picture format, the purpose is also to prevent crawler software crawl, but as a special development crawler ape, must take it to peace of mind, or sleep dream will also think how to break this damn picture Number!

Here we take advantage of Google's Open source project: TESSERACT-OCR (Project address: HTTPS://GITHUB.COM/TESSERACT-OCR)

In fact, tesseract online tutorial actually has a lot, about it's introduction, I don't say here, direct focus!

First to initialize the tesseract, here we use the default identification library, according to the characteristics of the 58 city number picture, we initialize the following:

The program needs to refer to: Tesseract.dll and the program root directory to have tessdata\\eng.traineddata of the identification library file

Tesseract.tesseractengine te = new Tesseractengine (Application.startuppath + "\\tessdata", "Eng", Enginemode.default) ;//Initialize, use the default recognition library here

Te. SetVariable ("Tessedit_char_whitelist", "0123456789");//Set the recognized Word Fu Bai list

Te. Defaultpagesegmode = pagesegmode.singleline;//Sets the recognition mode to single-line mode

The note is. NET version seems to have to be 3.5 and above, otherwise tesseract initialization always does not pass. This problem has plagued me for a long time.

First we get the 58 picture number on the same city as the following address:

Http://image.58.com/showphone.aspx?t=v55&v=6E0C227B5A963FC4VD7B70A4FC12D1D01

Download get get the following picture:

First the image binary algorithm (is turned into only black and white algorithm, search engine a bunch) to get the following image:

This monochrome image is much friendlier to the OCR engine and recognizes the algorithm:

We set a bitmap type variable btelimg store this binary number picture, String type Stelnumber used to save the number of recognition results, using the following algorithm to obtain the recognition result:

Page PG = te. Process (Pixconverter.topix (btelimg), pagesegmode.singleline);

Stelnumber= Keyreplace (PG. GetText ());

Under Recognition:

The amount ... 10,000 Grass mud horse galloping, wrong so much to do? Do you want to use Tesseract's advanced training algorithm to train your own library? All say simple identification, don't make so complicated good, I am lazy!

In fact, 58 on this image is generated dynamically, so each access to get the picture is different, including the number interval. The first download picture because the character picture adhesion problem, causes the recognition result is incorrect, we the same address again to download the picture again:

Binary Value:

Recognition:

Haha, finally right!
Prove that this free OCR engine directly downloaded without complex training or effective, the following we do not change the recognition algorithm under the premise of improving the recognition rate (after all, for this pure digital picture, want to OCR recognition rate is high, can only train or write a dedicated OCR engine)

Because the first recognition result is wrong, the second time to download the picture, the results are correct. So we can start from the recognition results, not to re-download the image, re-recognition, until the correct or set a threshold, to reach the threshold, incorrect I can not do! This is the only way for free!

Because here we recognize the mobile phone number, so know the law of mobile phone number, we will judge the results, we can initially determine the results of the fight!

The mobile phone number must be a 11-bit pure number (because we set the whitelist to be a pure number, so guarantee the result is 11 bit on it)
The mobile phone number must start with 13,15,18 (this can exclude a large part of the error)

Well, almost these two result rules can improve the recognition rate effectively. Algorithm I don't write here, it's a program ape, right?

At this point, a simple 58 mobile phone number picture is finished. Other, such as phone numbers, simple character verification code, the principle is similar. Hope to give beginners a little help, the next opportunity to talk to you about more advanced OCR recognition method.

Using Tesseract to identify 58 of the same city picture-type mobile phone number

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using Tesseract to identify 58 of the same city picture-type mobile phone number

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Using Tesseract to identify 58 of the same city picture-type mobile phone number

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support