This article mainly introduces the detailed description of the Python crawler verification code implementation function. For more information, see the next article, for more information, see
Main functions:
-Login webpage
-Dynamic waiting for webpage loading
-Verification code download
A long time ago, the idea was to automatically execute a function by script, saving a lot of manpower-the individual is relatively lazy. It took a few days to write the code. in the spirit of identifying the verification code, the problem was solved fundamentally, but the difficulty was too high and the recognition accuracy was too low. The plan was coming to an end again.
I hope this experience can be shared with you.
Open a Python browser
Compared with the urllib2 module, the operation is troublesome. for some webpages, it is inconvenient to save cookies. So here I am using the selenium module under Python2.7 for webpage operations.
Test webpage: Http://graduate.buct.edu.cn
Open the webpage :(Download chromedriver)
In order to support Chinese character output, we need to call sys module, change the default encoding to UTF-8
from selenium.webdriver.support.ui import Select, WebDriverWaitfrom selenium import webdriverfrom selenium import commonfrom PIL import Imageimport pytesserimport sysreload(sys)sys.setdefaultencoding('utf8')broswer = webdriver.Chrome()broswer.maximize_window()username = 'test'password = 'test'url = 'http://graduate.buct.edu.cn'broswer.get(url)
Wait for the webpage to load
Use WebDriverWait in selenium. the above code has been loaded
Url = 'http: // graduate.buct.edu.cn 'broswer. get (url) wait = WebDriverWait (webdriver, 5) # set the timeout time to 5 s # enter the code elm = wait in the form field and load it here. until (lambda webdriver: broswer. find_element_by_xpath (xpathMenuCheck ))
# Enter and load the code elm = wait. until (lambda webdriver: broswer. find_element_by_xpath (xpathMenuCheck) in the form ))
Element positioning and character input
Next, we need to log on: here I am using Chrome, right-click the part that needs to be filled in, select check, it will automatically jump to the developer mode under F12 (this function is required throughout the process to find relevant resources ).
Here, selenium's Select module is used for selection, and the positioning control uses find_element_by _ ** for one-to-one matching, which is very convenient.
select = Select(broswer.find_element_by_id('UserRole'))select.select_by_value('2')name = broswer.find_element_by_id('username')name.send_keys(username)pswd = broswer.find_element_by_id('password')pswd.send_keys(password)btnlg = broswer.find_element_by_id('btnLogin')btnlg.click()
This is the result of automatic filling with the script, and then it will jump to the next page.
Crawling information
The next step is to crawl the existing valid reports:
# Search for valid reports flag = 1 count = 2count_valid = 0 while flag: try: category = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (count) + ']/td [1]'). text count + = 1 TB t common. exceptions. noSuchElementException: break # obtain report information flag = 1for currentLecture in range (2, count): # category = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [1]'). text # name = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [2]'). text # unit: unitsPublish = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [3]'). text # start time startTime = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [4]'). text # endTime = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [5]'). text
Crawl verification code
# Obtain and verify the verification code (only one) authCodeURL = broswer. find_element_by_xpath ('// * [@ id = "Table2"]/tbody/tr [2]/td/p/img '). get_attribute ('src') # Comment ') rangle = (0, 0, 64, 28) I = Image.open('text.png') frame4 = comment ') qq = Image.open('authcode.png') text = pytesser. image_to_string (qq ). strip ()
# Obtain the verification code authCodeURL = broswer in batches. find_element_by_xpath ('// * [@ id = "Table2"]/tbody/tr [2]/td/p/img '). get_attribute ('src') # obtain the verification code address # obtain the learning sample for count in range (10): broswer. get (authCodeURL) broswer.save_screenshot('text.png ') rangle = (1, 1, 62, 27) I = Image.open('text.png') frame4 = I. crop (rangle) frame4.save ('authcode' + str (count) + '.png ') print 'Count:' + str (count) broswer. refresh () broswer. quit ()
The crawled verification code
As shown in the preceding verification code, the characters are rotated, and the overlap caused by rotation has a great impact on subsequent recognition. I have tried training using a neural network, but the accuracy is far from high because Feature vectors are not extracted.
For details about the Python crawler verification code implementation function, I will introduce it to you here. I hope it will be helpful to you!
For more details about the Python crawler verification code implementation function, refer to the PHP Chinese website!