12306 ticket sales website new version Verification code identification confrontation

Source: Internet
Author: User
Tags image processing library

12306 ticket sales website new version Verification code identification confrontation
In the previous article 12306, the official website pushed the new "image verification code" to grab the ticket software or the collective failure. 12306 the official website launched the new verification code to log on to the ticket grabbing tool. All the verification codes have been revised. Currently, all the ticketing tools are invalid.

Recently, the 12306 ticket booking website verification code was upgraded to the user's recognition image content, and then the qualified image was selected as the verification code, for example:

Many media news such as the collective failure of the ticket snatching tool and the 12306 final verification code. The launch of this verification code has the same advantages and disadvantages: Machine recognition is difficult, and human eyes cannot easily identify it.

The biggest concern for using this verification code is that it is afraid of scripts or manual crawling of its images. Then, it saves all the images and compares them with keywords and adds them to the database, of course, the premise is that these images are static.

12306 verification code is static or dynamic, last night on the practice of this question: http://linux.im/2015/03/17/12306-captcha-md5-go.html, simply put, after the test found that the entire picture is dynamically generated in the server back-end, it is not difficult to understand why the verification code page is slow.

In the same morning, we performed the second practice. We split the eight images in the verification code into eight small images and then processed them with a heartbeat hash to obtain 72225 samples, the number of images that are not repeated is 15478, and the maximum number of repeated images is 869:

Since it is not a static image (compared to nearly hash images), we will not waste time crawling static images for Data Association and warehouse receiving, however, we still need to "break" this verification code for no reason.

Finally, all the fragment code shown below will be open-source without worrying.

Keyword Recognition

Verification Code process:

Select answers to verification code questions (multiple answers)

For example, in the verification code diagram above, he is a whole image. To identify the keyword, he must first extract the image from the keyword area and then recognize it as text.

Here we use the Python PIL image processing library to select a region:

    def imgCut():    pic_file = downloadImg()    pic_path = "./12306_pic/%s.jpg" % pic_file    pic_text_path = './12306_pic/%s_text.jpg' % pic_file    pic_obj = Image.open(pic_path)    box = (120,0,290,25)    region = pic_obj.crop(box)    region.save(pic_text_path)    print '[*] Picture Text Picture: {}'.format(pic_text_path)    return pic_path, pic_text_path

 

The imgGut function first downloads the big image of the Verification Code (including the prompt word, keyword, and 8 images), and then saves it. the/12306_pic/directory is stored, and then the PIL library is used to cut the image (, 0,) area, that is, to obtain the keyword image area.

Now we have been able to download the verification code and cut out the desired keyword area. Next we will identify the keyword and convert it to text.

Some open-source optical character recognition modules should be able to identify, but this is not convenient for users to run, so I chose an online website OCR recognition, he can perform text recognition and conversion on the image you uploaded (the image we just cut). Of course, the accuracy is not that high. Remember this!

Some code is pasted here to implement the function (Pass in the text content of the keyword returned by the image ):

upload_pic_url = "http://cn.docs88.com/pdftowordupload2.php"filename_tmp = filename.split('/')[-1]pic_text_content = open(filename).read()para = {'Filename': filename_tmp,       'sourcename': filename_tmp,       'sourcelanguage': 'cn',       'desttype': 'txt',       'Upload': 'Submit Query',}upload_pic = requests.post(upload_pic_url, data=para, files={"Filedata" : open(filename, 'rb')})text_result_url = 'http://cn.docs88.com/' + upload_pic.content[3:]text_result = requests.get(text_result_url)return text_result.content

Let's try the following:

[+] Download Picture: https://kyf42412306.cn/otn/passcode..##] Picture Text Picture :. /12306_pic/1426580454_text.jpg [*] Text: shirt lining [+] Download Picture: https://kyf11212306.cn/otn/passcod..##] Picture Text Picture :. /12306_pic/1426580454_text.jpg [*] Text:) hat [+] Download Picture: https://kyf%12306.cn/otn/passcod..##] Picture Text Picture :. /12306_pic/1426580454_text.jpg [*] Text: Spring Festival couplets

The results are good. It is enough for us to test and use. Do you still remember his accuracy?

Clever Image Recognition

I have previously published an article on Image Recognition in Buzz: A Python script for Image Recognition Using the CloudSight API. This time, we will not use this script, the reason is that although the recognition accuracy is high but the speed is slightly slow, I do not like this one very much, coincidentally, A friend wrote an article and code that uses Baidu image recognition for image recognition. Google image recognition is certainly good, but we will use it here, so you don't have to worry about it.

Split the verification code image and drop it into Baidu image recognition. The API function returns the Baidu image recognition result.

Two rows horizontally, four rows in each row, and then Image Recognition and return:

dict_list = {}count = 0for y in range(2):    for x in range(4):        count += 1        im2 = get_sub_img(pic_path, x, y)        result = baidu_stu_lookup(im2)        dict_list[count] = result        print (y,x), result

The function is not posted here for the reason of article length. The recognition effect is as follows:

(0, 0) ice sculpture | night view of the Building (0, 1) fried summer bread | fast food (0, 2) lighthouse | high tower (0, 3) Hamburg | McDonald's fries | shop (1, 0) sports coat | protective clothing | sportswear (1, 1) silver gray | mobile phone | mobile edition (1, 2) bidding | planning (1, 3) mobile phone 

Well, now we can identify the keywords and 8 images of the verification code. We also need a machine to help us determine which images to choose.

It may be

The online OCR recognition accuracy mentioned in the previous two times is not that high, so in order to facilitate the program to help us think about this option intelligently, we will compare the results with false word segmentation.

First, split the keywords, and then compare the results cyclically. In this way, the text that is not accurately recognized can be ignored and the corresponding image recognition results can be identified. Here, the 8 image results are differentiated by 1-8, the first line is left to right (1-4), and the second line is (5-8 ):

if captcha_text.strip() > 2:    print '\n[*] Maybe the result of the:'    maybe_result = []    for v in dict_list:        for c in range(len(unicode(captcha_text.strip(), 'utf8'))):            text = unicode(captcha_text, 'utf8')[c]            if text in dict_list[v]:                _str_res = '%s --- %s' % (v, dict_list[v])                maybe_result.append(_str_res)    for r in list(set(maybe_result)):        print relse:    print '[-] False'

Well, even if the recognition rate is not that high, we can try to find out the answer as much as possible and see the effect:

Not over

Is it over? Actually no.

We used scripts to perform a lot of tests, and the success rate was good enough to let some evil people do something, but the verification code confrontation has been ongoing, of course, getting more and more interesting :)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.