Python Crawler Crawl Verification Code implementation features

Source: Internet
Author: User
Main implementation Features:

-Landing Page

-Dynamically waiting for Web pages to load

-Verification Code Download

An early idea is to automatically follow the script to perform a function, saving a lot of manpower--personal lazy. Spent a few days to write, in the spirit of want to complete the identification of verification code, fundamentally solve the problem, but the difficulty is too high, the accuracy of recognition is too low, plan again.
I hope this experience can be shared and communicated with you.

Python Open Browser

Compared with the URLLIB2 module, the operation is more troublesome, for some pages also need to save the cookie, very inconvenient. So, I use the Python2.7 under the Selenium module for the operation of the Web page.

test page : http://graduate.buct.edu.cn

Open webpage: ( need to download Chromedriver)

In order to support Chinese character output, we need to call the SYS module and change the default encoding to UTF-8

from selenium.webdriver.support.ui import Select, WebDriverWaitfrom selenium import webdriverfrom selenium import commonfrom PIL import Imageimport pytesserimport sysreload(sys)sys.setdefaultencoding('utf8')broswer = webdriver.Chrome()broswer.maximize_window()username = 'test'password = 'test'url = 'http://graduate.buct.edu.cn'broswer.get(url)

Wait for Web page to finish loading

Using the webdriverwait in selenium, the above code has been loaded

url = 'http://graduate.buct.edu.cn'broswer.get(url)wait = WebDriverWait(webdriver,5) #设置超时时间5s# 在这里输入表单填写并加载的代码elm = wait.until(lambda webdriver: broswer.find_element_by_xpath(xpathMenuCheck))

element positioning, character input

Next we need to sign in: Here I use Chrome, right-click to select the part that needs to fill in the content, select Check, will automatically jump to the developer mode under the F12 (this function is required to find the relevant resources).

Vczkprbljnjkcxvvo9gh1pht0lnytcsyv7fwpgjyic8+dqo8aw1nigfsdd0= "Write a picture here describing" src= "http://www.jb51.net/uploadfile/ Collfiles/20160414/20160414092144893.png "title=" \ "/>

Here we see a value = "1", considering the properties of the drop-down box, we just have to find a way to assign this value to Userrole.
The use of this is through the Selenium Select module to choose, positioning control using find_element_by_**, can correspond, very convenient.

select = Select(broswer.find_element_by_id('UserRole'))select.select_by_value('2')name = broswer.find_element_by_id('username')name.send_keys(username)pswd = broswer.find_element_by_id('password')pswd.send_keys(password)btnlg = broswer.find_element_by_id('btnLogin')btnlg.click()

This is an effect that is automatically filled out with a script, and then jumps to the next page.


Here, what I need is the function is to automatically register the academic report


Right-click on a report that needs to be available to find a message related to this activity, because there is no report now, so only the title is displayed, but there is a similarity to the subsequent valid report recognition.


For the positioning of elements, I prefer XPath, according to the test, you can uniquely locate an element's location, very useful.

//*[@id="dgData00"]/tbody/tr/td[2] (前面是xpath)

Crawling information

The next step we will take is to crawl the existing valid reports:

# 寻找有效报告flag = 1count = 2count_valid = 0while flag:  try:    category = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(count) + ']/td[1]').text    count += 1  except common.exceptions.NoSuchElementException:    break# 获取报告信息flag = 1for currentLecture in range(2, count):  # 类别  category = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[1]').text  # 名称  name = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[2]').text  # 单位  unitsPublish = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[3]').text  # 开始时间  startTime = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[4]').text  # 截止时间  endTime = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[5]').text

Crawl Verification Code

After the element review of the verification code in the Web page, we found one of the links, which is identifyingcode.apsx, which we will load on this page and obtain the verification code in bulk.

The idea of crawling is to use selenium to intercept the current page (shown only), and save it locally--the study that needs to turn pages and intercept specific locations:

Broswer.set_window_position (* *) related functions, and then manual verification code positioning, through the PIL module to intercept and save.

Finally call Google in Python under the Pytesser character recognition, but the site's verification code has a lot of interference, plus the character rotation, only to recognize some of the characters.

# 获取验证码并验证(仅仅一幅)authCodeURL = broswer.find_element_by_xpath('//*[@id="Table2"]/tbody/tr[2]/td/p/img').get_attribute('src') # 获取验证码地址broswer.get(authCodeURL)broswer.save_screenshot('text.png')rangle = (0, 0, 64, 28)i = Image.open('text.png')frame4 = i.crop(rangle)frame4.save('authcode.png')qq = Image.open('authcode.png')text = pytesser.image_to_string(qq).strip()# 批量获取验证码authCodeURL = broswer.find_element_by_xpath('//*[@id="Table2"]/tbody/tr[2]/td/p/img').get_attribute('src') # 获取验证码地址# 获取学习样本for count in range(10):  broswer.get(authCodeURL)  broswer.save_screenshot('text.png')  rangle = (1, 1, 62, 27)  i = Image.open('text.png')  frame4 = i.crop(rangle)  frame4.save('authcode' + str(count) + '.png')  print 'count:' + str(count)  broswer.refresh()broswer.quit()

Crawl down the verification code

Part of the Verification Code original:

As seen from the verification code above, the characters are rotated, and the overlap caused by rotation has a significant effect on subsequent recognition. I have tried using neural networks for training, but the accuracy rate is ridiculously low due to the absence of eigenvector extraction.

Python crawler crawl Verification code to implement the function of the detailed introduction to the here, I hope to help you!

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.