Python crawler verification code implementation

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces the detailed description of the Python crawler verification code implementation function. For more information, see the next article, for more information, see

Main functions:

-Login webpage

-Dynamic waiting for webpage loading

-Verification code download

A long time ago, the idea was to automatically execute a function by script, saving a lot of manpower-the individual is relatively lazy. It took a few days to write the code. in the spirit of identifying the verification code, the problem was solved fundamentally, but the difficulty was too high and the recognition accuracy was too low. The plan was coming to an end again.
I hope this experience can be shared with you.

Open a Python browser

Compared with the urllib2 module, the operation is troublesome. for some webpages, it is inconvenient to save cookies. So here I am using the selenium module under Python2.7 for webpage operations.

Test webpage: Http://graduate.buct.edu.cn

Open the webpage :(Download chromedriver)

In order to support Chinese character output, we need to call sys module, change the default encoding to UTF-8

from selenium.webdriver.support.ui import Select, WebDriverWaitfrom selenium import webdriverfrom selenium import commonfrom PIL import Imageimport pytesserimport sysreload(sys)sys.setdefaultencoding('utf8')broswer = webdriver.Chrome()broswer.maximize_window()username = 'test'password = 'test'url = 'http://graduate.buct.edu.cn'broswer.get(url)

Wait for the webpage to load

Use WebDriverWait in selenium. the above code has been loaded

Url = 'http: // graduate.buct.edu.cn 'broswer. get (url) wait = WebDriverWait (webdriver, 5) # set the timeout time to 5 s # enter the code elm = wait in the form field and load it here. until (lambda webdriver: broswer. find_element_by_xpath (xpathMenuCheck ))# Enter and load the code elm = wait. until (lambda webdriver: broswer. find_element_by_xpath (xpathMenuCheck) in the form ))

Element positioning and character input

Next, we need to log on: here I am using Chrome, right-click the part that needs to be filled in, select check, it will automatically jump to the developer mode under F12 (this function is required throughout the process to find relevant resources ).

Here, selenium's Select module is used for selection, and the positioning control uses find_element_by _ ** for one-to-one matching, which is very convenient.

select = Select(broswer.find_element_by_id('UserRole'))select.select_by_value('2')name = broswer.find_element_by_id('username')name.send_keys(username)pswd = broswer.find_element_by_id('password')pswd.send_keys(password)btnlg = broswer.find_element_by_id('btnLogin')btnlg.click()

This is the result of automatic filling with the script, and then it will jump to the next page.

Crawling information

The next step is to crawl the existing valid reports:

# Search for valid reports flag = 1 count = 2count_valid = 0 while flag: try: category = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (count) + ']/td [1]'). text count + = 1 TB t common. exceptions. noSuchElementException: break # obtain report information flag = 1for currentLecture in range (2, count): # category = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [1]'). text # name = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [2]'). text # unit: unitsPublish = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [3]'). text # start time startTime = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [4]'). text # endTime = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [5]'). text

Crawl verification code

# Obtain and verify the verification code (only one) authCodeURL = broswer. find_element_by_xpath ('// * [@ id = "Table2"]/tbody/tr [2]/td/p/img '). get_attribute ('src') # Comment ') rangle = (0, 0, 64, 28) I = Image.open('text.png') frame4 = comment ') qq = Image.open('authcode.png') text = pytesser. image_to_string (qq ). strip ()# Obtain the verification code authCodeURL = broswer in batches. find_element_by_xpath ('// * [@ id = "Table2"]/tbody/tr [2]/td/p/img '). get_attribute ('src') # obtain the verification code address # obtain the learning sample for count in range (10): broswer. get (authCodeURL) broswer.save_screenshot('text.png ') rangle = (1, 1, 62, 27) I = Image.open('text.png') frame4 = I. crop (rangle) frame4.save ('authcode' + str (count) + '.png ') print 'Count:' + str (count) broswer. refresh () broswer. quit ()

The crawled verification code

As shown in the preceding verification code, the characters are rotated, and the overlap caused by rotation has a great impact on subsequent recognition. I have tried training using a neural network, but the accuracy is far from high because Feature vectors are not extracted.

For details about the Python crawler verification code implementation function, I will introduce it to you here. I hope it will be helpful to you!

For more details about the Python crawler verification code implementation function, refer to the PHP Chinese website!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler verification code implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler verification code implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support