Source: http://www.moonlord.cn/blog/blog.php?id=1408361938 [Wuhan University educational system intrusion & steal data whole process record][php Verification code recognition (OCR) technology paste]
[on the importance of revising the default password for the educational system] [Does death really not die?] ]
2014.8.15
Play my own Write the subscription number ("Month Wing technology"), suddenly think of the legendary (because I have only heard, not used) artifact "Wuhan University assistant" seemingly can check the results, Baidu, it is crawling the data inside the educational system, but the educational system is not a verification code, even if there is an account password, Can't get through the verification code or can not crawl the data ah. So, how does it solve the problem?
Look at the login page of the educational system, the amount, each time the verification code server re-allocation session, the client side only saved a SessionID cookies (such as).
In other words, when browsing this page, the client, in addition to the verification code of this picture, did not get any information about the content of the verification code.
This is the most standard and most scientific practice I know now, seemingly invulnerable.
Is it because the verification code is too simple, Wuhan University assistant he can use a third-party class library to hack out?
Decisive Baidu, the results found there can be recognized the number of class library Valite, but the verification code itself is very high (clear and black and white), there can be recognized letters and numbers of the third-party class library tesseract, seemingly also open source ... But the download found that after the installation of the complex teaching process ...
In short, the search php+ Verification Code identification, the basic can not find a good tutorial or the source of anything, simply helpless.
Mother Egg, as a iter, too dependent on the third-party class library decisive is not a good habit, no class library on their own write!
Referring to the idea of identifying a digital verification code, first ...
Separate each pixel of a picture into a matrix (or 2-dimensional array?). )。
Analyzing the RGB values of each point, observing the enlarged view of the verification code, it is obvious that the part of the word, the value of R (red) is higher.
Then the part of the pixel of the word is given a different value in order to differentiate.
The final effect is similar:
Well, now the program is able to "analyze" this picture, but how can the program "understand" it?
According to the position of Baidu, we need to do some processing of this "figure", first of all, to eliminate the useless "noise" removed:
Then:
So now, this "picture" looks so much more recognizable.
Of course not every time the diagram is so clear, such as there are:
So, I think, can be removed from the "noise" process again, so:
Next, the process of "removing" the point is complete, starting to "supplement" some points to make the letters appear more "coherent" some:
There are 4 letters in the code, and it is clear that I need to separate it, so I cut a few knives vertically:
Yes, but sometimes it's the wrong place, the amount, like this:
After some effort (dead n brain cells), it will not be the wrong cut, and then cut sideways:
2014.8.16
When I got here, I suddenly recognized that the captcha was not hot.
If, in fact, the same letter cut out, its length and width, as well as the inside of the black and white position is basically the same.
Then, decisively put the letter of the long and arranged position (for example, n TXT file).
For example, depending on the length and width, you can basically determine which letters are possible.
But when I gather more and more letters that are long and wide, the situation becomes this:
Of course, when you're lucky, you can do the following:
But by luck, the recognition success rate is too low, and sure enough, it is better than the letter matrix (two-dimensional array) similarity to judge.
(The N-layer loop is written here, and the IQ is completely inadequate.) )
With the growing number of TXT files in the form of letters, it is more decisive or ambiguous to match:
Then, continuing the identification test, I finally found a lunatic example (3 letters sticking together):
(Lying trough, not with such a!) )
(So go back to the code to change the alphabet, Mom eggs, finally, even if 4 letters are connected together can give accurate cut.) )
A whim, in fact, at the beginning of the code can be added to a "black" border, so that the edge of the "white" point can also be used in the "noise reduction" process as "noise" removed.
Although a variety of wonderful verification code has, but there are the following, relying on this fully standardized verification code to give me confidence:
With the collection of letters "style" TXT file more and more, found the best recognition of the two letters: O and C, almost completely symmetrical, and experimented many times, O and C on the different appearance of rice, only this:
Tested 200+ verification code, constantly test, find errors, save txt, fix code ...
Code PART1 (Analyze pictures based on Reb color):
Code Part2 (fuzzy matching according to the letter information saved by the TXT file):
Before the verification code images are downloaded to the local and then identified, now decisively changed to instant access, instant recognition.
Identify success:
Recognition failure:
Statistical time (the speed of recognition is still very good, relative to the time of obtaining a verification code, the time of recognition can be negligible):
Start validating the results on the educational system (obviously not a big problem):
2014.8.17
Next, start doing something like "Wuhan University assistant" and crawl my own account data:
The code for the person who makes the system is very good (like this kind of person), it is obviously convenient for him to develop the process (of course it is also convenient for me AH).
The location of the grades and timetable is easy to find.
= =
So, happily started experimenting with fetching data:
And:
And:
Indicates that the time to crawl the data seems to be slightly longer:
Now, write a code-guessing PHP that fits my previous use. NET write a attack on the teacher homework site program, began to violence guessing password, Yeah.
(That is, PHP is actually single-threaded, which is a flaw in its relative to ASP. NET and JSP, and PHP theoretically says that it is best to limit the execution time, so, it is not appropriate to write a DDoS attack program directly.) )
(Decisive many languages, playing skills will be more happy=.) =
Guess the password, Yeah. (Good like ~\ (≧▽≦)/~)
In fact, the password of the educational system is not very difficult to guess, I guess from 199,301,011 straight guessed 19971231, at most also try 366*5=1830 just.
For the machine, the Loop tries 1830 times ... so easy.
You can use the command line to write your own attack program:
(The parameter means to test from 20,133,025,800,011 to 2013302580315, open 5 threads.) )
Like this:
Later found too slow, the amount, so:
Because the verification code is not very large (1-2kb-like paper), so, it seems not very much speed:
So, the amount, is a program bug, the amount, has guessed the password of the account, the program is still trying to keep the password, the amount of decisive correction code.
Experimental results: 30 threads, 1 hours, guess the password of about 30 accounts.
2014.8.18
It took about 6-7 hours last night to guess a bunch of passwords:
(Fail's representative tried 1830 password also try not thick, the amount, it is possible that the verification code identification is wrong, but the larger may be: These children's paper has been clever to change the password!!! 23333)
Integration of TXT into the PHP file, easy to upload to my SAE server (seems to be able to improve some efficiency it).
Change the code, decisive or the following more effective rate:
Compress php files, and decisively upload to my SAE server, uh.
And:
Use SAE to go to the Dean's system to catch the verification code will be faster? I'm not making it.
But Jiangzi, at least in line with the so-called "load Balancing" it (can also be called the Boys to help me) ...
(My Sina bean has not spent more than 50, see Jiangzi can burn off more beans =.) =
Start laying out the database, ready to crawl a lot of data:
Evening 7 o'clock, the amount, turn off guessing the code program, count, altogether get the 119 school number of the password, determined to start to save data ...
Finally, a total score record entry 2,655, the next semester of the selection record 1452.
Say this time is not much, otherwise you can sweep the full courtyard of the data, have free can also sweep the school. ~\ (≧▽≦)/~
Using existing data to calculate the average, the amount, found the highest course is the military theory (92.8), the lowest is about the number of logic (60.1), the sum, fortunately I had it too biased hardware, did not choose, I still quite witty.
Oddly enough, the average of C + + programming (experiment) is 78.6, but the average of C + + programming is only 60.1 (0.06 higher than the number logic).
The evil of the University Physics B (UP), the average score is 66.9, the sum.
The average of college English 2 is only 75.2, meaning that people with more than half will not be able to test 4 levels next semester?
Well, the following cut back to the point, Wuhan University Assistant (later thought of the course lattice) is how to import the information of the educational system?
In fact, that's what they do:
Lying trough! I should have thought of it! The verification code of course is sent to the user to identify AH!!!
Well, it doesn't matter anymore.
= =
[PS: This article has a feeling of wood a little cold joke? ]
The whole process of intrusion & stealing data of Wuhan University educational system