URL submission is a webmaster tool provided by Baidu, used to provide manual access to some URL interface, but the interface has a verification code identification part, more difficult to get. Therefore, the following program is written to automatically identify the verification code:
Main ideas
Obtain multiple verification codes, submit to http://lab.ocrking.com/for multiple recognition, and then calculate each CAPTCHA image identified by the letters or numbers to statistics, the highest statistical rate is the verification code.
Copy the Code code as follows:
#!/usr/bin/env python
#-*-Coding:utf-8-*-
Import requests
Import time
Import JSON
Import re
if __name__ = = "__main__":
i = 1
s = requests.session ()
S.headers.update ({' Referer ': ' Http://zhanzhang.baidu.com/sitesubmit/index ', ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/33.0.1750.154 safari/537.36 '})
r = S.get (' Http://zhanzhang.baidu.com/sitesubmit/index ')
S2 = requests.session ()
r = S.post (' Http://zhanzhang.baidu.com/captcha ', data={' async ': ' false ', ' n ': Time.time ()})
url = json.loads (r.content) [' URL ']
temp = []
While 1:
Try
r = S.get (URL)
Img_data = R.content
r = S2.get (' http://lab.ocrking.com/')
Try
Content = '. Join (R.content.split ())
Sid = Re.findall (R ' "Sid": "(. +?)" ', content) [0]
hash_1 = Re.findall (R ' "Hash": "(. +?)" ', content) [0]
Timestamp = Re.findall (R ' "Timestamp": "(. +?)" ', content) [0]
Except
print ' Error on get orking info! '
Continue
Files = {' Filedata ':(' icode.jpeg ', Img_data)}
data = {' Filename ': ' icode.jpeg ', ' Sid ': Sid, ' Hash ': hash_1, ' timestamp ': timestamp}
r = S2.post (' http://lab.ocrking.com/upload.html ', files = files,data= data)
r = S2.post (' http://lab.ocrking.com/ocrking.html ', data={' upfile ': r.content, ' type ': ' Captcha ', ' CharSet ': ' 7 '})
Icode = Re.findall (R ' (.+?) ', r.content) [0]
If Len (icode)! = 4:
Continue
Temp.append (Icode)
i = i + 1
if i = = 3:
Break
Except Exception,e:
Print E
Pass
A = {' 0 ': {}, ' 1 ': {}, ' 2 ': {}, ' 3 ': {}}
For AA in Temp:
i = 0
While I <=3:
Try
A[str (i)][aa[i]] = A[STR (i)][aa[i]] + 1
Except
A[str (i)][aa[i]] = 1
i = i + 1
Icode = [', ', ', ', ']
For index in a:
Temp_times = 0
For index_1 in A[index]:
If a[index][index_1] >= temp_times:
Temp_times = A[index][index_1]
Icode[int (index)] = Index_1
Icode = ". Join (Icode)
Img_name = ' temp\\ ' +icode+ '. png '
File_object = open (Img_name, ' W ')
File_object.write (Img_data)
File_object.close ()
#r = S.post (' http://zhanzhang.baidu.com/sitesubmit/sitepost ', data={' url ': ' http://lab.ocrking.com/', ' captcha ': Icode})
#print r.content