First, preface
This experiment will explain the principle of cracking http://www.php.cn/code/6832.html "target=" _blank "> Verification Code by a simple example, and will learn and practice the following points of knowledge:
Basic Python knowledge
Use of the PIL module
Second, detailed examples
To Install the Pillow (PIL) Library:
$ sudo apt-get update$ sudo apt-get install python-dev$ sudo apt-get install libtiff5-dev libjpeg8-dev zlib1g-dev \libfree Type6-dev liblcms2-dev libwebp-dev tcl8.6-dev tk8.6-dev python-tk$ sudo pip install Pillow
Download the files for the experiment:
$ wget Http://labfile.oss.aliyuncs.com/courses/364/python_captcha.zip $ unzip python_captcha.zip$ CD Python_captcha
This is the verification code we used in our experiment Captcha.gif
Extract text pictures
Create a new crack.py file in the working directory and edit it.
#-*-Coding:utf8-*-from PIL Import Imageim = Image.open ("Captcha.gif") # (convert picture to 8-bit pixel mode) im = Im.convert ("P") #打印颜色直方图print Im.histogram ()
Output:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0 , 0, 1, 0, 1, 0, 3, 1, 3, 3, 0, 0, 0, 0, 0, 0, 1, 0, 3, 2, 132, 1, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 15, 0, 1, 0, 1, 0, 0, 8, 1, 0, 0, 0, 0, 1, 6, 0, 2, 0, 0, 0, 0, 18, 1, 1, 1, 1, 1, 2, 365, 115, 0, 1, 0, 0, 0, 135, 186, 0, 0, 1, 0, 0 , 3, 0, 0, 0, 0, 0,, 1, 1, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, ten, 0, 0, 0, 0, 1, 0, 625]
Each digit of the color histogram represents the number of pixels in the picture that contain the corresponding bit color.
Each pixel can represent 256 colors, and you will find that the white point is the most (the position of the black number 255, which is the last one, you can see that there are 625 white pixels). Red pixels in the ordinal 200 or so, we can get useful color by sorting.
he = Im.histogram () values = {}for I in range ($): values[i] = his[i]for j,k in sorted (Values.items (), Key=lambda x:x[1],r Everse = True) [: ten]: Print J,k
Output:
255 625212 365220 186219 135169 132227 116213 115234 21205 18184 15
We got the top 10 colors in the picture, 220 and 227 are the red and gray we need, and we can construct a black-and-white two-value image from this message.
#-*-Coding:utf8-*-from PIL Import Imageim = Image.open ("Captcha.gif") im = Im.convert ("P") im2 = Image.new ("P", im.size,25 5) for X in range (Im.size[1]): To Y in range (im.size[0]): pix = Im.getpixel ((y,x)) if pix = =, or pix = = 227: # These is the numbers to get Im2.putpixel ((y,x), 0) im2.show ()
The results obtained:
Extract a single character picture
The next task is to get a set of pixels for a single character, and because of the simplicity of the example, we cut it vertically:
Inletter = Falsefoundletter=falsestart = 0end = 0letters = []for y in range (im2.size[0]): For x in range (Im2.size[1]): pix = Im2.getpixel ((y,x)) if pix! = 255: Inletter = True if Foundletter = False and Inletter = true: fo Undletter = True start = y if Foundletter = = true and Inletter = = False: Foundletter = False end = y le Tters.append ((start,end)) Inletter=falseprint letters
Output:
[(6, 14), (15, 25), (27, 35), (37, 46), (48, 56), (57, 67)]
Gets the column ordinal at which each character starts and ends.
Import Hashlibimport Timecount = 0for letter in letters:m = HASHLIB.MD5 () im3 = Im2.crop ((letter[0], 0, LETTER[1],IM2.S IZE[1]) m.update ("%s%s"% (Time.time (), count)) Im3.save ("./%s.gif"% (M.hexdigest ()))) Count + = 1
(next to the code above)
Cut the picture to get the part of the picture where each character is located.
AI and vector space image recognition
Here we use the vector space search engine to do character recognition, which has many advantages:
No need for a lot of training iterations
Don't train too much
You can add/remove the wrong data to view the effects at any time
Very easy to understand and write into code
Provides grading results, you can view the closest multiple matches
If you add a search engine to something that is not recognized, you can identify it immediately.
Of course, it also has shortcomings, such as the speed of classification is much slower than the neural network, it can not find their own methods to solve problems and so on.
Vector space search engine name sounds very tall on the fact that the principle is very simple. Take the example in the article:
You have 3 documents, how do we calculate the similarity between them? The more words you use in 2 documents, the more similar the two articles are! But this word too much how to do, we choose a few key words, the choice of words are called features, each feature is like a dimension in space (x, y, z, etc.), a set of features is a vector, each document we can get such a vector, As long as the angle between the calculation vector can get the similarity of the article.
To implement a vector space with a Python class:
Import MathClass vectorcompare: #计算矢量大小 def magnitude (self,concordance): Total = 0 for Word,count in concordance . Iteritems (): total + = Count * * 2 return math.sqrt (total) #计算矢量之间的 cos value def relation (Self,concordance1, Concorda NCE2): relevance = 0 topvalue = 0 for Word, count in Concordance1.iteritems (): if Concordance2.has_key ( Word): Topvalue + = Count * Concordance2[word] return topvalue/(Self.magnitude (Concordance1) * Self.magnitude ( CONCORDANCE2))
It compares two Python dictionary types and outputs their similarity (denoted by 0~1 numbers)
Put the previous content together
There are a large number of verification code to extract a single character picture as a training set of work, but as long as there is a good read above the students must know how to do the work, here is omitted. You can use the provided training collection directly to do the following.
The IconSet directory is our training set.
Last additions:
#将图片转换为矢量def buildvector (IM): D1 = {} count = 0 for I in Im.getdata (): D1[count] = i count + = 1 return d1v = Vecto Rcompare () IconSet = [' 0 ', ' 1 ', ' 2 ', ' 3 ', ' 4 ', ' 5 ', ' 6 ', ' 7 ', ' 8 ', ' 9 ', ' 0 ', ' a ', ' B ', ' C ', ' d ', ' e ', ' f ', ' g ', ' h ', ' I ', ' j ', ' K ', ' l ' , ' m ', ' n ', ' o ', ' P ', ' Q ', ' R ', ' s ', ' t ', ' u ', ' V ', ' w ', ' x ', ' y ', ' z '] #加载训练集imageset = []for letter in Iconset:for img in Os.listdir ('./iconset/%s/'% (letter)): temp = [] if img! = "THUMBS.DB" and img! = ". Ds_store ": temp.append (Buildvector (Image.open ("./iconset/%s/%s "% (letter,img))) Imageset.append ({ Letter:temp}) Count = 0# to cut the captcha picture for the letter in letters:m = HASHLIB.MD5 () im3 = Im2.crop ((letter[0], 0, letter[1],im2.si ZE[1]) guess = [] #将切割得到的验证码小片段与每个训练片段进行比较 for image in ImageSet: for x, y in Image.iteritems (): If Len (y)! = 0:< C8/>guess.append ((V.relation (Y[0],buildvector (IM3)), x)) Guess.sort (reverse=true) print "", guess[0] count + = 1
Get results
Everything is ready, run our code and try it:
Python crack.py
Output
(0.96376811594202894, ' 7 ') (0.96234028545977002, ' s ') (0.9286884286888929, ' 9 ') (0.98350370609844473, ' t ') (0.96751165072506273, ' 9 ') (0.96989711688772628, ' J ')
It's a positive solution, good work.
Summarize
The above is the entire content of this article, I hope that the content of this article on everyone's study or work can bring certain help, if there is doubt you can message exchange.