Main implementation Features:
-Landing Page
-Dynamically waiting for Web pages to load
-Verification Code Download
An early idea is to automatically follow the script to perform a function, saving a lot of manpower--personal lazy. Spent a few days to write, in the spirit of want to complete the identification of verification code, fundamentally solve the problem, but the difficulty is too high, the accuracy of recognition is too low, plan again.
I hope this experience can be shared and communicated with you.
Python Open Browser
Compared with the URLLIB2 module, the operation is more troublesome, for some pages also need to save the cookie, very inconvenient. So, I use the Python2.7 under the Selenium module for the operation of the Web page.
test page : http://graduate.buct.edu.cn
Open webpage: ( need to download Chromedriver)
In order to support Chinese character output, we need to call the SYS module and change the default encoding to UTF-8
from selenium.webdriver.support.ui import Select, WebDriverWaitfrom selenium import webdriverfrom selenium import commonfrom PIL import Imageimport pytesserimport sysreload(sys)sys.setdefaultencoding('utf8')broswer = webdriver.Chrome()broswer.maximize_window()username = 'test'password = 'test'url = 'http://graduate.buct.edu.cn'broswer.get(url)
Wait for Web page to finish loading
Using the webdriverwait in selenium, the above code has been loaded
url = 'http://graduate.buct.edu.cn'broswer.get(url)wait = WebDriverWait(webdriver,5) #设置超时时间5s# 在这里输入表单填写并加载的代码elm = wait.until(lambda webdriver: broswer.find_element_by_xpath(xpathMenuCheck))
element positioning, character input
Next we need to sign in: Here I use Chrome, right-click to select the part that needs to fill in the content, select Check, will automatically jump to the developer mode under the F12 (this function is required to find the relevant resources).
Vczkprbljnjkcxvvo9gh1pht0lnytcsyv7fwpgjyic8+dqo8aw1nigfsdd0= "Write a picture here describing" src= "http://www.jb51.net/uploadfile/ Collfiles/20160414/20160414092144893.png "title=" \ "/>
Here we see a value = "1", considering the properties of the drop-down box, we just have to find a way to assign this value to Userrole.
The use of this is through the Selenium Select module to choose, positioning control using find_element_by_**, can correspond, very convenient.
select = Select(broswer.find_element_by_id('UserRole'))select.select_by_value('2')name = broswer.find_element_by_id('username')name.send_keys(username)pswd = broswer.find_element_by_id('password')pswd.send_keys(password)btnlg = broswer.find_element_by_id('btnLogin')btnlg.click()
This is an effect that is automatically filled out with a script, and then jumps to the next page.
Here, what I need is the function is to automatically register the academic report
Right-click on a report that needs to be available to find a message related to this activity, because there is no report now, so only the title is displayed, but there is a similarity to the subsequent valid report recognition.
For the positioning of elements, I prefer XPath, according to the test, you can uniquely locate an element's location, very useful.
//*[@id="dgData00"]/tbody/tr/td[2] (前面是xpath)
Crawling information
The next step we will take is to crawl the existing valid reports:
# 寻找有效报告flag = 1count = 2count_valid = 0while flag: try: category = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(count) + ']/td[1]').text count += 1 except common.exceptions.NoSuchElementException: break# 获取报告信息flag = 1for currentLecture in range(2, count): # 类别 category = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[1]').text # 名称 name = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[2]').text # 单位 unitsPublish = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[3]').text # 开始时间 startTime = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[4]').text # 截止时间 endTime = broswer.find_element_by_xpath('//*[@id="dgData00"]/tbody/tr[' + str(currentLecture) + ']/td[5]').text
Crawl Verification Code
After the element review of the verification code in the Web page, we found one of the links, which is identifyingcode.apsx, which we will load on this page and obtain the verification code in bulk.
The idea of crawling is to use selenium to intercept the current page (shown only), and save it locally--the study that needs to turn pages and intercept specific locations:
Broswer.set_window_position (* *) related functions, and then manual verification code positioning, through the PIL module to intercept and save.
Finally call Google in Python under the Pytesser character recognition, but the site's verification code has a lot of interference, plus the character rotation, only to recognize some of the characters.
# 获取验证码并验证(仅仅一幅)authCodeURL = broswer.find_element_by_xpath('//*[@id="Table2"]/tbody/tr[2]/td/p/img').get_attribute('src') # 获取验证码地址broswer.get(authCodeURL)broswer.save_screenshot('text.png')rangle = (0, 0, 64, 28)i = Image.open('text.png')frame4 = i.crop(rangle)frame4.save('authcode.png')qq = Image.open('authcode.png')text = pytesser.image_to_string(qq).strip()
# 批量获取验证码authCodeURL = broswer.find_element_by_xpath('//*[@id="Table2"]/tbody/tr[2]/td/p/img').get_attribute('src') # 获取验证码地址# 获取学习样本for count in range(10): broswer.get(authCodeURL) broswer.save_screenshot('text.png') rangle = (1, 1, 62, 27) i = Image.open('text.png') frame4 = i.crop(rangle) frame4.save('authcode' + str(count) + '.png') print 'count:' + str(count) broswer.refresh()broswer.quit()
Crawl down the verification code
Part of the Verification Code original:
As seen from the verification code above, the characters are rotated, and the overlap caused by rotation has a significant effect on subsequent recognition. I have tried using neural networks for training, but the accuracy rate is ridiculously low due to the absence of eigenvector extraction.
Python crawler crawl Verification code to implement the function of the detailed introduction to the here, I hope to help you!