Python crawler verification code implementation function details, python Crawler
Main functions:
-Login webpage
-Dynamic waiting for webpage Loading
-Verification Code download
A long time ago, the idea was to automatically execute a function by script, saving a lot of manpower-the individual is relatively lazy. It took a few days to write the code. In the spirit of identifying the verification code, the problem was solved fundamentally, but the difficulty was too high and the recognition accuracy was too low. The plan was coming to an end again.
I hope this experience can be shared with you.
Open a Python Browser
Compared with the urllib2 module, the operation is troublesome. For some webpages, it is inconvenient to save cookies. So here I am using the selenium module under Python2.7 for webpage operations.
Test webpage: Http://graduate.buct.edu.cn
Open the webpage :(Download chromedriver)
In order to support Chinese character output, we need to call sys module, change the default encoding to UTF-8
<code class="hljs python">from selenium.webdriver.support.ui import Select, WebDriverWaitfrom selenium import webdriverfrom selenium import commonfrom PIL import Imageimport pytesserimport sysreload(sys)sys.setdefaultencoding('utf8')broswer = webdriver.Chrome()broswer.maximize_window()username = 'test'password = 'test'url = 'http://graduate.buct.edu.cn'broswer.get(url)</code>
Wait for the webpage to load
Use WebDriverWait in selenium. The above code has been loaded
<Code class = "hljs livecodeserver"> url = 'HTTP: // graduate.buct.edu.cn 'broswer. get (url) wait = WebDriverWait (webdriver, 5) # Set the timeout time to 5 s # Enter the code elm = wait in the form field and load it here. until (lambda webdriver: broswer. find_element_by_xpath (xpathMenuCheck) </code>
Element positioning and Character Input
Next, we need to log on: Here I am using Chrome, right-click the part that needs to be filled in, select Check, it will automatically jump to the developer Mode Under F12 (this function is required throughout the process to find relevant resources ).
VczKprbLJnJkcXVvO9Gh1PHT0LnYtcSyv7fWPGJyIC8 + DQo8aW1nIGFsdD0 = "here write picture description" src = "http://www.bkjia.com/uploadfile/Collfiles/20160414/20160414092144893.png" title = "\"/>
Here we can see that there is a value = "1". Considering the properties of the drop-down box, we just need to assign this value to UserRole.
Here, selenium's Select module is used for selection, and the positioning control uses find_element_by _ ** for one-to-one matching, which is very convenient.
<code class="hljs sql">select = Select(broswer.find_element_by_id('UserRole'))select.select_by_value('2')name = broswer.find_element_by_id('username')name.send_keys(username)pswd = broswer.find_element_by_id('password')pswd.send_keys(password)btnlg = broswer.find_element_by_id('btnLogin')btnlg.click()</code>
This is the result of automatic filling with the script, and then it will jump to the next page.
Here, what I need is to automatically register for the academic report.
Right-click an existing report to find the message related to the activity. Because there is no report, only the title is displayed, but the effective report identification is similar.
For element positioning, I chose xpath first. According to the test, the position of an element can be uniquely located, which is very useful.
<Code class = "hljs perl"> // * [@ id = "dgData00"]/tbody/tr/td [2] (previously xpath) </code>
Crawling Information
The next step is to crawl the existing valid reports:
<Code class = "hljs axapta"> # Find a valid report. flag = 1 count = 2count_valid = 0 while flag: try: category = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (count) + ']/td [1]'). text count + = 1 tb t common. exceptions. noSuchElementException: break # obtain report information flag = 1for currentLecture in range (2, count): # category = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [1]'). text # name = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [2]'). text # Unit: unitsPublish = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [3]'). text # Start Time startTime = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [4]'). text # endTime = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [5]'). text </code>
Crawl Verification Code
After an element review of the verification code on the webpage, we found a link, namely IdentifyingCode. apsx. We will load the page and obtain verification codes in batches.
The method of crawling is to use selenium to capture the current page (only display part) and save it to the local device. to flip the page and capture a specific location, Please study:
Broswer. set_window_position (**) related functions; then, manually locate the verification code and extract and save it through the PIL module.
Finally, Google's pytesser in Python is called for character recognition. However, the verification code of this website has a lot of interference, and the character rotation can only recognize a part of the characters.
<Code class = "hljs livecodeserver"> # obtain and verify the verification code (only one) authCodeURL = broswer. find_element_by_xpath ('// * [@ id = "Table2"]/tbody/tr [2]/td/p/img '). get_attribute ('src') # comment ') rangle = (0, 0, 64, 28) I = Image.open('text.png') frame4 = comment ') qq = Image.open('authcode.png') text = pytesser. image_to_string (qq ). strip () </code> <code class = "hljs axapta"> # obtain the verification code authCodeURL = broswer in batches. find_element_by_xpath ('// * [@ id = "Table2"]/tbody/tr [2]/td/p/img '). get_attribute ('src') # obtain the Verification Code address # obtain the learning sample for count in range (10): broswer. get (authCodeURL) broswer.save_screenshot('text.png ') rangle = (1, 1, 62, 27) I = Image.open('text.png') frame4 = I. crop (rangle) frame4.save ('authcode' + str (count) + '.png ') print 'count:' + str (count) broswer. refresh () broswer. quit () </code>
The crawled Verification Code
Source images of some verification codes:
As shown in the preceding verification code, the characters are rotated, and the overlap caused by rotation has a great impact on subsequent recognition. I have tried training using a neural network, but the accuracy is far from high because feature vectors are not extracted.
For details about the Python crawler verification code implementation function, I will introduce it to you here. I hope it will be helpful to you!
Articles you may be interested in:
- Python image Verification Code
- Python image verification code sharing
- Example code of a Chinese Verification Code randomly generated by Python
- Python adds recaptcha verification code for tornado
- Example of a random Verification Code (Chinese Verification Code) generated by python
- Python generates Verification Code instances
- Python crawler simulated logon website with verification code