Python crawler verification code implementation function details, python Crawler

Last Update:2016-04-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Main functions:

-Login webpage

-Dynamic waiting for webpage Loading

-Verification Code download

A long time ago, the idea was to automatically execute a function by script, saving a lot of manpower-the individual is relatively lazy. It took a few days to write the code. In the spirit of identifying the verification code, the problem was solved fundamentally, but the difficulty was too high and the recognition accuracy was too low. The plan was coming to an end again.
I hope this experience can be shared with you.

Open a Python Browser

Compared with the urllib2 module, the operation is troublesome. For some webpages, it is inconvenient to save cookies. So here I am using the selenium module under Python2.7 for webpage operations.

Test webpage: Http://graduate.buct.edu.cn

Open the webpage :(Download chromedriver)

In order to support Chinese character output, we need to call sys module, change the default encoding to UTF-8

<code class="hljs python">from selenium.webdriver.support.ui import Select, WebDriverWaitfrom selenium import webdriverfrom selenium import commonfrom PIL import Imageimport pytesserimport sysreload(sys)sys.setdefaultencoding('utf8')broswer = webdriver.Chrome()broswer.maximize_window()username = 'test'password = 'test'url = 'http://graduate.buct.edu.cn'broswer.get(url)</code>

Wait for the webpage to load

Use WebDriverWait in selenium. The above code has been loaded

<Code class = "hljs livecodeserver"> url = 'HTTP: // graduate.buct.edu.cn 'broswer. get (url) wait = WebDriverWait (webdriver, 5) # Set the timeout time to 5 s # Enter the code elm = wait in the form field and load it here. until (lambda webdriver: broswer. find_element_by_xpath (xpathMenuCheck) </code>

Element positioning and Character Input

Next, we need to log on: Here I am using Chrome, right-click the part that needs to be filled in, select Check, it will automatically jump to the developer Mode Under F12 (this function is required throughout the process to find relevant resources ).

VczKprbLJnJkcXVvO9Gh1PHT0LnYtcSyv7fWPGJyIC8 + DQo8aW1nIGFsdD0 = "here write picture description" src = "http://www.bkjia.com/uploadfile/Collfiles/20160414/20160414092144893.png" title = "\"/>

Here we can see that there is a value = "1". Considering the properties of the drop-down box, we just need to assign this value to UserRole.
Here, selenium's Select module is used for selection, and the positioning control uses find_element_by _ ** for one-to-one matching, which is very convenient.

<code class="hljs sql">select = Select(broswer.find_element_by_id('UserRole'))select.select_by_value('2')name = broswer.find_element_by_id('username')name.send_keys(username)pswd = broswer.find_element_by_id('password')pswd.send_keys(password)btnlg = broswer.find_element_by_id('btnLogin')btnlg.click()</code>

This is the result of automatic filling with the script, and then it will jump to the next page.

Here, what I need is to automatically register for the academic report.

Right-click an existing report to find the message related to the activity. Because there is no report, only the title is displayed, but the effective report identification is similar.

For element positioning, I chose xpath first. According to the test, the position of an element can be uniquely located, which is very useful.

<Code class = "hljs perl"> // * [@ id = "dgData00"]/tbody/tr/td [2] (previously xpath) </code>

Crawling Information

The next step is to crawl the existing valid reports:

<Code class = "hljs axapta"> # Find a valid report. flag = 1 count = 2count_valid = 0 while flag: try: category = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (count) + ']/td [1]'). text count + = 1 tb t common. exceptions. noSuchElementException: break # obtain report information flag = 1for currentLecture in range (2, count): # category = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [1]'). text # name = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [2]'). text # Unit: unitsPublish = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [3]'). text # Start Time startTime = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [4]'). text # endTime = broswer. find_element_by_xpath ('// * [@ id = "dgData00"]/tbody/tr [' + str (currentLecture) + ']/td [5]'). text </code>

Crawl Verification Code

After an element review of the verification code on the webpage, we found a link, namely IdentifyingCode. apsx. We will load the page and obtain verification codes in batches.

The method of crawling is to use selenium to capture the current page (only display part) and save it to the local device. to flip the page and capture a specific location, Please study:

Broswer. set_window_position (**) related functions; then, manually locate the verification code and extract and save it through the PIL module.

Finally, Google's pytesser in Python is called for character recognition. However, the verification code of this website has a lot of interference, and the character rotation can only recognize a part of the characters.

<Code class = "hljs livecodeserver"> # obtain and verify the verification code (only one) authCodeURL = broswer. find_element_by_xpath ('// * [@ id = "Table2"]/tbody/tr [2]/td/p/img '). get_attribute ('src') # comment ') rangle = (0, 0, 64, 28) I = Image.open('text.png') frame4 = comment ') qq = Image.open('authcode.png') text = pytesser. image_to_string (qq ). strip () </code> <code class = "hljs axapta"> # obtain the verification code authCodeURL = broswer in batches. find_element_by_xpath ('// * [@ id = "Table2"]/tbody/tr [2]/td/p/img '). get_attribute ('src') # obtain the Verification Code address # obtain the learning sample for count in range (10): broswer. get (authCodeURL) broswer.save_screenshot('text.png ') rangle = (1, 1, 62, 27) I = Image.open('text.png') frame4 = I. crop (rangle) frame4.save ('authcode' + str (count) + '.png ') print 'count:' + str (count) broswer. refresh () broswer. quit () </code>

The crawled Verification Code

Source images of some verification codes:

As shown in the preceding verification code, the characters are rotated, and the overlap caused by rotation has a great impact on subsequent recognition. I have tried training using a neural network, but the accuracy is far from high because feature vectors are not extracted.

For details about the Python crawler verification code implementation function, I will introduce it to you here. I hope it will be helpful to you!

Articles you may be interested in:

Python image Verification Code
Python image verification code sharing
Example code of a Chinese Verification Code randomly generated by Python
Python adds recaptcha verification code for tornado
Example of a random Verification Code (Chinese Verification Code) generated by python
Python generates Verification Code instances
Python crawler simulated logon website with verification code

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler verification code implementation function details, python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler verification code implementation function details, python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support