URL submission is a webmaster tool provided by Baidu, for the webmaster to provide manual collection of some URL interface, but the interface has a verification code identification part, more difficult to get. Therefore, the following program was written to identify the verification code automatically:
Main ideas
Get multiple authentication codes, submit to http://lab.ocrking.com/for multiple recognition, and then calculate the number of letters or numbers identified by each validation code picture, and the highest statistical rate is the verification code.
Copy Code code as follows:
#!/usr/bin/env python
#-*-Coding:utf-8-*-
Import requests
Import time
Import JSON
Import re
if __name__ = = "__main__":
i = 1
s = requests.session ()
S.headers.update ({' Referer ': ' Http://zhanzhang.baidu.com/sitesubmit/index ', ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/33.0.1750.154 safari/537.36 '})
r = S.get (' Http://zhanzhang.baidu.com/sitesubmit/index ')
S2 = requests.session ()
r = S.post (' Http://zhanzhang.baidu.com/captcha ', data={' async ': ' false ', ' n ': Time.time ()})
url = json.loads (r.content) [' URL ']
temp = []
While 1:
Try
r = S.get (URL)
Img_data = R.content
r = S2.get (' http://lab.ocrking.com/')
Try
Content = '. Join (R.content.split ())
Sid = Re.findall (R ' "Sid": "(. +?)" ', content) [0]
hash_1 = Re.findall (R ' "Hash": "(. +?)" ', content) [0]
Timestamp = Re.findall (R ' "Timestamp": "(. +?)" ", content) [0]
Except
print ' Error on get orking info! '
Continue
Files = {' Filedata ':(' icode.jpeg ', Img_data)}
data = {' Filename ': ' icode.jpeg ', ' Sid ': Sid, ' Hash ': hash_1, ' timestamp ': timestamp}
r = S2.post (' http://lab.ocrking.com/upload.html ', files = files,data= data)
r = S2.post (' http://lab.ocrking.com/ocrking.html ', data={' upfile ': r.content, ' type ': ' Captcha ', ' CharSet ': ' 7 '})
Icode = Re.findall (R ' <OcrResult> (. +?) </OcrResult> ', r.content) [0]
If Len (Icode)!= 4:
Continue
Temp.append (Icode)
i = i + 1
if i = = 3:
Break
Except Exception,e:
Print E
Pass
A = {' 0 ': {}, ' 1 ': {}, ' 2 ': {}, ' 3 ': {}}
For AA in Temp:
i = 0
While I <=3:
Try
A[str (i)][aa[i]] = A[STR (i)][aa[i]] + 1
Except
A[str (i)][aa[i]] = 1
i = i + 1
Icode = [', ', ', ', ', ']
For index in a:
Temp_times = 0
For index_1 in A[index]:
If a[index][index_1] >= temp_times:
Temp_times = A[index][index_1]
Icode[int (index)] = Index_1
Icode = '. Join (Icode)
Img_name = ' temp\\ ' +icode+ '. png '
File_object = open (Img_name, ' W ')
File_object.write (Img_data)
File_object.close ()
#r = S.post (' http://zhanzhang.baidu.com/sitesubmit/sitepost ', data={' url ': ' http://lab.ocrking.com/', ' captcha ': Icode})
#print r.content