Python crawler verification code implementation

Source: Internet
Author: User
This article mainly introduces the detailed description of the Python crawler verification code implementation function. For more information, see the next article, for more information, see

Main functions:

-Login webpage

-Dynamic waiting for webpage loading

-Verification code download

A long time ago, the idea was to automatically execute a function by script, saving a lot of manpower-the individual is relatively lazy. It took a few days to write the code. in the spirit of identifying the verification code, the problem was solved fundamentally, but the difficulty was too high and the recognition accuracy was too low. The plan was coming to an end again.
I hope this experience can be shared with you.

Open a Python browser

Compared with the urllib2 module, the operation is troublesome. for some webpages, it is inconvenient to save cookies. So here I am using the selenium module under Python2.7 for webpage operations.

Test webpage: Http://graduate.buct.edu.cn

Open the webpage :(Download chromedriver)

In order to support Chinese character output, we need to call sys module, change the default encoding to UTF-8


from selenium.webdriver.support.ui import Select, WebDriverWaitfrom selenium import webdriverfrom selenium import commonfrom PIL import Imageimport pytesserimport sysreload(sys)sys.setdefaultencoding('utf8')broswer = webdriver.Chrome()broswer.maximize_window()username = 'test'password = 'test'url = 'http://graduate.buct.edu.cn'broswer.get(url)


Wait for the webpage to load

Use WebDriverWait in selenium. the above code has been loaded


Url = 'http: // graduate.buct.edu.cn 'broswer. get (url) wait = WebDriverWait (webdriver, 5) # set the timeout time to 5 s # enter the code elm = wait in the form field and load it here. until (lambda webdriver: broswer. find_element_by_xpath (xpathMenuCheck ))# Enter and load the code elm = wait. until (lambda webdriver: broswer. find_element_by_xpath (xpathMenuCheck) in the form ))


Element positioning and character input

Next, we need to log on: here I am using Chrome, right-click the part that needs to be filled in, select check, it will automatically jump to the developer mode under F12 (this function is required throughout the process to find relevant resources ).

Here, selenium's Select module is used for selection, and the positioning control uses find_element_by _ ** for one-to-one matching, which is very convenient.

select = Select(broswer.find_element_by_id('UserRole'))select.select_by_value('2')name = broswer.find_element_by_id('username')name.send_keys(username)pswd = broswer.find_element_by_id('password')pswd.send_keys(password)btnlg = broswer.find_element_by_id('btnLogin')btnlg.click()


This is the result of automatic filling with the script, and then it will jump to the next page.

Crawling information

The next step is to crawl the existing valid reports:

# Search for valid reports flag = 1 count = 2count_valid = 0 while flag: try: category = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (count) + ']/td [1]'). text count + = 1 TB t common. exceptions. noSuchElementException: break # obtain report information flag = 1for currentLecture in range (2, count): # category = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [1]'). text # name = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [2]'). text # unit: unitsPublish = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [3]'). text # start time startTime = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [4]'). text # endTime = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [5]'). text


Crawl verification code

# Obtain and verify the verification code (only one) authCodeURL = broswer. find_element_by_xpath ('// * [@ id = "Table2"]/tbody/tr [2]/td/p/img '). get_attribute ('src') # Comment ') rangle = (0, 0, 64, 28) I = Image.open('text.png') frame4 = comment ') qq = Image.open('authcode.png') text = pytesser. image_to_string (qq ). strip ()# Obtain the verification code authCodeURL = broswer in batches. find_element_by_xpath ('// * [@ id = "Table2"]/tbody/tr [2]/td/p/img '). get_attribute ('src') # obtain the verification code address # obtain the learning sample for count in range (10): broswer. get (authCodeURL) broswer.save_screenshot('text.png ') rangle = (1, 1, 62, 27) I = Image.open('text.png') frame4 = I. crop (rangle) frame4.save ('authcode' + str (count) + '.png ') print 'Count:' + str (count) broswer. refresh () broswer. quit ()


The crawled verification code

As shown in the preceding verification code, the characters are rotated, and the overlap caused by rotation has a great impact on subsequent recognition. I have tried training using a neural network, but the accuracy is far from high because Feature vectors are not extracted.

For details about the Python crawler verification code implementation function, I will introduce it to you here. I hope it will be helpful to you!

For more details about the Python crawler verification code implementation function, refer to the PHP Chinese website!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.