Python Crawler Crawl Verification Code implementation features

Last Update:2016-06-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Main implementation Features:

-Landing Page

-Dynamically waiting for Web pages to load

-Verification Code Download

An early idea is to automatically follow the script to perform a function, saving a lot of manpower--personal lazy. Spent a few days to write, in the spirit of want to complete the identification of verification code, fundamentally solve the problem, but the difficulty is too high, the accuracy of recognition is too low, plan again.
I hope this experience can be shared and communicated with you.

Python Open Browser

Compared with the URLLIB2 module, the operation is more troublesome, for some pages also need to save the cookie, very inconvenient. So, I use the Python2.7 under the Selenium module for the operation of the Web page.

test page : http://graduate.buct.edu.cn

Open webpage: ( need to download Chromedriver)

In order to support Chinese character output, we need to call the SYS module and change the default encoding to UTF-8

from selenium.webdriver.support.ui import Select, WebDriverWaitfrom selenium import webdriverfrom selenium import commonfrom PIL import Imageimport pytesserimport sysreload(sys)sys.setdefaultencoding('utf8')broswer = webdriver.Chrome()broswer.maximize_window()username = 'test'password = 'test'url = 'http://graduate.buct.edu.cn'broswer.get(url)

Wait for Web page to finish loading

Using the webdriverwait in selenium, the above code has been loaded

url = 'http://graduate.buct.edu.cn'broswer.get(url)wait = WebDriverWait(webdriver,5) #设置超时时间5s# 在这里输入表单填写并加载的代码elm = wait.until(lambda webdriver: broswer.find_element_by_xpath(xpathMenuCheck))

element positioning, character input

Next we need to sign in: Here I use Chrome, right-click to select the part that needs to fill in the content, select Check, will automatically jump to the developer mode under the F12 (this function is required to find the relevant resources).

Vczkprbljnjkcxvvo9gh1pht0lnytcsyv7fwpgjyic8+dqo8aw1nigfsdd0= "Write a picture here describing" src= "http://www.jb51.net/uploadfile/ Collfiles/20160414/20160414092144893.png "title=" \ "/>

Here we see a value = "1", considering the properties of the drop-down box, we just have to find a way to assign this value to Userrole.
The use of this is through the Selenium Select module to choose, positioning control using find_element_by_**, can correspond, very convenient.

select = Select(broswer.find_element_by_id('UserRole'))select.select_by_value('2')name = broswer.find_element_by_id('username')name.send_keys(username)pswd = broswer.find_element_by_id('password')pswd.send_keys(password)btnlg = broswer.find_element_by_id('btnLogin')btnlg.click()

This is an effect that is automatically filled out with a script, and then jumps to the next page.

Here, what I need is the function is to automatically register the academic report

Right-click on a report that needs to be available to find a message related to this activity, because there is no report now, so only the title is displayed, but there is a similarity to the subsequent valid report recognition.

For the positioning of elements, I prefer XPath, according to the test, you can uniquely locate an element's location, very useful.

//*[@id="dgData00"]/tbody/tr/td[2] （前面是xpath）

Crawling information

The next step we will take is to crawl the existing valid reports:

# 寻找有效报告flag = 1count = 2count_valid = 0while flag:  try:    category = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(count) + ']/td[1]').text    count += 1  except common.exceptions.NoSuchElementException:    break# 获取报告信息flag = 1for currentLecture in range(2, count):  # 类别  category = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[1]').text  # 名称  name = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[2]').text  # 单位  unitsPublish = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[3]').text  # 开始时间  startTime = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[4]').text  # 截止时间  endTime = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[5]').text

Crawl Verification Code

After the element review of the verification code in the Web page, we found one of the links, which is identifyingcode.apsx, which we will load on this page and obtain the verification code in bulk.

The idea of crawling is to use selenium to intercept the current page (shown only), and save it locally--the study that needs to turn pages and intercept specific locations:

Broswer.set_window_position (* *) related functions, and then manual verification code positioning, through the PIL module to intercept and save.

Finally call Google in Python under the Pytesser character recognition, but the site's verification code has a lot of interference, plus the character rotation, only to recognize some of the characters.

# 获取验证码并验证（仅仅一幅）authCodeURL = broswer.find_element_by_xpath('//*[@id="Table2"]/tbody/tr[2]/td/p/img').get_attribute('src') # 获取验证码地址broswer.get(authCodeURL)broswer.save_screenshot('text.png')rangle = (0, 0, 64, 28)i = Image.open('text.png')frame4 = i.crop(rangle)frame4.save('authcode.png')qq = Image.open('authcode.png')text = pytesser.image_to_string(qq).strip()# 批量获取验证码authCodeURL = broswer.find_element_by_xpath('//*[@id="Table2"]/tbody/tr[2]/td/p/img').get_attribute('src') # 获取验证码地址# 获取学习样本for count in range(10):  broswer.get(authCodeURL)  broswer.save_screenshot('text.png')  rangle = (1, 1, 62, 27)  i = Image.open('text.png')  frame4 = i.crop(rangle)  frame4.save('authcode' + str(count) + '.png')  print 'count:' + str(count)  broswer.refresh()broswer.quit()

Crawl down the verification code

Part of the Verification Code original:

As seen from the verification code above, the characters are rotated, and the overlap caused by rotation has a significant effect on subsequent recognition. I have tried using neural networks for training, but the accuracy rate is ridiculously low due to the absence of eigenvector extraction.

Python crawler crawl Verification code to implement the function of the detailed introduction to the here, I hope to help you!



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler Crawl Verification Code implementation features

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Crawler Crawl Verification Code implementation features

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support