Python crawler scrapy framework-manual recognition, logon, inverted text verification code, and digital English Verification Code,

Source: Internet
Author: User

Python crawler scrapy framework-manual recognition, logon, inverted text verification code, and digital English Verification Code,

Currently, zhihu uses the verification code of the inverted text in the click graph:

 

You need to click the inverted text in the figure to log on.

This makes it difficult for crawlers to solve the problem. After a day of patience, they can finally manually identify the verification code and reach the login success status. We will discuss it with you in the following sections.

 

To learn about crawlers, we first need to know what fields the browser transmits to the server (I use the Safari browser for demonstration, of course Chrome and Firefox can)

We clicked the first and second texts:

Right-click the review element and click "Log On". The following information is displayed:

You can obtain from the right: the URL sent by the message is https: // www/zhihu/com/login/phone_num.

This is not hard to understand. We know that the login is to distinguish between the mobile phone and the mailbox. We use a mobile phone to log on. The tested email URL is: https: // www/zhihu/com/login/email.

Pull down the resource on the right:

In addition to phone_num as the user name and password, we also see several important information: _ xsrf, captcha, captcha_type

So what do they mean? After repeated queries, I will explain the introduction to you:

_ Xsrf: The Cross Site Request Forgery (CSRF/XSRF) (Cross Site Request Forgery) is a known security protocol, when you access zhihu homepage www.zhihu.com for the first time, zhihu will automatically send a _ xsrf field to your browser and bind it to your host, then, every time you access zhihu server, your browser will carry this field. zhihu will find that the _ xsrf you sent is the _ xsrf that I gave you before you can access it, if _ xsrf is incorrect or is not specified, the access fails. By the way, this is a security mechanism. Basically all websites will set an XSRF/CSRF field to prevent hacker attacks.

Since we know that every time the browser sends a request to zhihu, it will carry the _ xsrf field, in this case, crawlers must simulate a browser to access zhihu homepage to obtain the _ xsrf field and submit this field upon logon to log on to the website.

Captcha: the "img_size" field in is fixed. Each time it is [, 44], it should be the image size. The "input_points" behind is the coordinate of the inverted text in the verification code. Because the seven text positions in the verification code are fixed, we just need to click each word and log on again, review the elements to confirm that the coordinates of each word can be simulated and clicked (whether it is clear). In this step, click to obtain the coordinates. I will take out the seven text coordinates I tested in sequence: [22.796875, 22], [42.796875, 22], [63.796875, 21], [84.796875, 20], [107.796875, 20], [129.796875, 22], [150.796875, 22].

Captcha_type: this field is interesting. Here is a tip to switch to a digital English verification code. If you set it to "cn", it is the inverted text verification code. If you set it to "en", it is the digital English verification code. If I didn't set it to "cn", there are many online digital English verification codes, you can search for it on your own. (If this field is not entered, it is also a digital English verification code)

The idea here is very clear. Once we know these three important information, we will know how to let crawlers know about logon. If we don't talk about it, we will go directly to the Code:

Import requeststry: import cookielibexcept: import http. cookiejar as cookielibimport reimport timedef get_xsrf (): # Get xsrf code response = requests. get ('https: // www.zhihu.com ', headers = header) # print (response. text) match_obj = re. match ('[\ s \ S] * name = "_ xsrf" value = "(. *?) "', Response. text) if match_obj: return match_obj.group (1) return ''def get_captcha (): # The Verification Code URL is captcha_url = 'https: // www. zhihu. cdom/captcha.gif? R = % d & type = login & lang = cn' % (int (time. time () * 1000) response = session. get (captcha_url, headers = header) # Save the verification code to the current directory with open('captcha.gif ', 'wb') as f: f. write (response. content) f. close () # automatically open the obtained verification code from PIL import Image try: img = Image.open('captcha.gif ') img. show () img. close () points t: pass points = [[22.796875, 22], [42.796875, 22], [63.796875, 21], [84.796875, 20], [107.796875, 20], [129.796875, 22], [150.796875, 22] seq = input ('Enter the inverted position \ n> ') s = ''for I in seq: s + = str (points [int (I)-1]) + ', 'Return' {"img_size": [], "input_points ": [% s]} '% s [:-1] def zhihu_login (account, password): # log on to if re. match ('1 \ d {10} ', account): print ('login by phone number') post_url = 'https: // www.zhihu.com/login/phone_num' post_data = {'captcha _ type ': 'cn', '_ xsrf': get_xsrf (), 'phone _ num': account, 'Password': password, 'captcha': get_captcha (),} response_text = session. post (post_url, data = post_data, headers = header) session. cookies. save () if _ name _ = '_ main _': agent = 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8 'header = {'host': 'www .zhihu.com ', 'Referer': 'https: // www.zhihu.com', 'user-agent ': agent,} session = requests. session () zhihu_login ('Enter the logon mobile phone number', 'Enter the logon password ')

I'm sure you can understand the code.

It is important to note that the session must be used to request the verification code and then use the session to submit the message. Why not use requests? A session is a session. If a session is used to access a website, the session will be held later to request the website, it will completely bring back the cookies that the website brings to us or the session that the website puts into the field. The cookies here are very important. When we visit zhihu, whether or not we log on to the server, we can put some values in our headers. Let's use pycharm's debug to look at the session:

We can see that there are many cookies in it. The Cookies sent to us by the server when the verification code is obtained must be sent to the zhihu server upon login before authentication is successful. If requests is used during login, it will establish a session again, so that it will not be able to pass the cookies brought by the Verification code to the server. This means the authentication fails.

Okay, you're done! Just enter the first few words that are inverted. For example, if the second and fourth words are inverted, enter: 24 and press enter to automatically add coordinates!

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.