Python Crawl Verification Code

Last Update:2017-08-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Main implementation Features:
- Landing Page
- Dynamic Wait page loading
- Verification Code Download

A very early idea was to run a function on your own initiative, according to the script. Save a lot of manpower--individuals are more lazy. Spent a few days to write, in the spirit of want to complete verification code identification, fundamentally solve this problem, is simply too high, the accuracy of recognition is too low. The plan is over again.
I hope this experience can be shared and communicated with you.

Note: The username and password in the code are not valid!

Python Open Browser

Compared with the URLLIB2 module, the operation is more troublesome. It is very inconvenient to save cookies for some pages. So. I'm using the Selenium module under Python2.7 to do the work on the Web page.

Test page: http://graduate.buct.edu.cn

Open webpage: (need to download chromedriver)
To support the output of Chinese characters, we need to call the SYS module. Change the default encoding to UTF-8

fromimport Select, WebDriverWaitfromimport webdriverfromimport commonfromimport Imageimport pytesserimport sysreload(sys)sys.setdefaultencoding(‘utf8‘‘test‘‘test‘‘http://graduate.buct.edu.cn‘broswer.get(url)

Waiting for page loading to complete

The webdriverwait in selenium is used. The above code has loaded the

‘http://graduate.buct.edu.cn‘broswer.get(url)wait = WebDriverWait(webdriver,5#设置超时时间5s# 在这里输入表单填写并载入的代码wait.until(lambda webdriver: broswer.find_element_by_xpath(xpathMenuCheck))

element positioning, character input

Next we need to sign in: Here I use Chrome, right-click to select the part that needs to fill in the content, select Check, will be self-motivated to jump to F12 under the developer mode (the whole need this function to find the relevant resources).

The Userrole section below is part of the "teacher-side" selection

Here we see a value = "1", considering the properties of the drop-down box, we just have to find a way to assign this value to Userrole.
The use of this is through the Selenium Select module to choose, positioning control using find_element_by_**, can be a corresponding, very convenient.

select = Select(broswer.find_element_by_id(‘UserRole‘))select.select_by_value(‘2‘)name = broswer.find_element_by_id(‘username‘)name.send_keys(username)pswd = broswer.find_element_by_id(‘password‘)pswd.send_keys(password)btnlg = broswer.find_element_by_id(‘btnLogin‘)btnlg.click()

This is the effect of using the script to populate itself, then jumps to the next page.

Here, what I need is the ability to sign up for the academic report on my own initiative.

Right-click on an existing report to find the message about the event, as there are no reports today, so only the title is displayed. But there is a similar place for the subsequent valid report recognition.

For the positioning of the element, I chose XPath first. According to the test. The ability to uniquely position an element is very useful.

//*[@id="dgData00"]/tbody/tr/td[2]  （前面是xpath）

Crawling information

The next step we will take is to crawl the existing valid reports:

# Find Effective reportsFlag =1Count=2Count_valid =0 whileFlagTry: Category = Broswer.find_element_by_xpath ('//*[@id = ' dgData00 ']/tbody/tr['+Str(Count) +']/td[1] '). TextCount+=1ExceptCommon. Exceptions. Nosuchelementexception: Break# Get report InformationFlag =1 forCurrentlecture in range (2,Count):# categoryCategory = Broswer.find_element_by_xpath ('//*[@id = ' dgData00 ']/tbody/tr['+Str(currentlecture) +']/td[1] '). Text# NameName = Broswer.find_element_by_xpath ('//*[@id = ' dgData00 ']/tbody/tr['+Str(currentlecture) +']/td[2] '). Text# unitsUnitspublish = Broswer.find_element_by_xpath ('//*[@id = ' dgData00 ']/tbody/tr['+Str(currentlecture) +']/td[3] '). Text# Start TimeStartTime = Broswer.find_element_by_xpath ('//*[@id = ' dgData00 ']/tbody/tr['+Str(currentlecture) +']/td[4] '). Text# cut -off timeEndTime = Broswer.find_element_by_xpath ('//*[@id = ' dgData00 ']/tbody/tr['+Str(currentlecture) +']/td[5] '). Text

Crawl Verification Code

After examining the elements of the verification code in the Web page, we found one of the links, which is identifyingcode.apsx. We will load this page and obtain the verification code in bulk.

The idea of a

Crawl is to use selenium to intercept the current page (showing only parts). and save to local--need to page and intercept the specific location of the study: broswer.set_window_position (* *) related functions, and then manual verification code positioning, through the PIL module to intercept and save. The
Last call to Google in Python under the Pytesser character recognition, but the site's verification code has a lot of interference, plus the character rotation, only to identify some of the characters.

# 获取验证码并验证（仅仅一幅）authCodeURL = broswer.find_element_by_xpath(‘//*[@id="Table2"]/tbody/tr[2]/td/p/img‘).get_attribute(‘src‘)  # 获取验证码地址broswer.get(authCodeURL)broswer.save_screenshot(‘text.png‘)rangle = (006428)i = Image.open(‘text.png‘)frame4 = i.crop(rangle)frame4.save(‘authcode.png‘)qq = Image.open(‘authcode.png‘)text = pytesser.image_to_string(qq).strip()

# Get verification code in bulkAuthcodeurl = Broswer.find_element_by_xpath ('//*[@id = ' Table2 ']/tbody/tr[2]/td/p/img '). Get_attribute (' src ')# Get Verification code address# Get Learning Samples for CountIn range (Ten): Broswer.get (Authcodeurl) Broswer.save_screenshot (' Text.png ') Rangle = (1,1, +, -) i = Image.open (' Text.png ') Frame4 = I.crop (rangle) Frame4.save (' Authcode '+Str(Count) +'. png ') Print' Count: '+Str(Count) Broswer.refresh () Broswer.quit ()

Part of the Verification Code original:

See from the above verification code. The characters are rotated and the overlap due to rotation can have a very large effect on the recognition. I have tried using neural networks for training. However, there is no feature vector extraction. The accuracy rate is ridiculously low.

This is a writer's experience after practice:
Http://www.cnblogs.com/sweetwxh/p/captcha_recognize.html

After watching, I became sober. decided not to continue identifying the verification code. But this is a very practical experience, and we'll be able to crawl all sorts of data later.

Python Crawl Verification Code

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawl Verification Code

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support