The verification code that uses inverted text in the click Graph is now known:
Users need to click on the inverted text in the graph to sign in.
This to the crawler to bring a certain difficulty, but not impossible to solve, after a day of patient query, finally can manually identify the verification code and to achieve the status of login success, the following will be with everyone.
The first thing we learn about spiders is what fields the browser transmits to the server (I use the Safari browser to do the demo, and of course, Chrome, Firefox)
We clicked on the first and second words:
Right-click the review element--tap login to see:
From the right you can get: The message sent by the URL is: https://www/zhihu/com/login/phone_num
It is not difficult to understand that the login is to distinguish between mobile phone and mailbox, we use a mobile phone login, the URL of the test mailbox is: Https://www/zhihu/com/login/email, here is not.
Pull the right resource down:
In addition to Phone_num is the user name, password is the password, we also see a few important information:_xsrf, CAPTCHA, Captcha_type
So the point is, what does it mean? After repeated inquiries, direct explanation of the introduction to everyone:
_xsrf: "Cross site Request Forgery" (CSRF/XSRF), which is known as a security protocol, When you first visit the home page www.zhihu.com, you will automatically send a _XSRF field to your browser and bind to your host, then every time you visit the server, your browser will take this field, you know that you send the _XSRF is I give you _ XSRF can be accessed if the _XSRF error or is not filled. By the way, this is a security mechanism, and basically all sites will have a XSRF/CSRF field set up to prevent hackers from attacking.
Since we know that every time the browser sends a request to the _XSRF field, then our crawler must simulate the browser access to the first page to get the _XSRF field and submit this field at login to log on successfully.
Captcha: in the "img_size" field is fixed, each time is [200,44], should be the image size meaning. After the "Input_points" is you click on the verification code in the inverted text coordinates, because the verification code in the seven text position is fixed, we just click on each word and then log in, and then review the elements to determine the coordinates of each word can be simulated click (is not enlightened), This step is self-clicking to get the coordinates, I took my test seven text coordinates in turn to take out:[22.796875,22],[42.796875,22],[63.796875,21],[84.796875,20],[ 107.796875,20],[129.796875,22],[150.796875,22].
Captcha_type: This field is interesting, there is a small trick to switch to digital English verification code. If it is set to "CN" is inverted text verification code, set to "en" is the digital English verification code, I did not set up here as "cn", digital English verification code online A lot, we can find their own. (The test does not fill this field is also the digital English verification Code)
Here the idea is very clear, we know the three important information, we know how to let the crawler login to know, not much to say directly on the code:
ImportRequestsTry: ImportCookielibexcept: ImportHttp.cookiejar as CookielibImportReImport Timedefget_xsrf ():#Get XSRF CodeResponse = Requests.get ('https://www.zhihu.com', headers=header)#print (Response.text)Match_obj = Re.match ('[\s\s]*name= "_XSRF" value= "(. *?)"', Response.text)ifMatch_obj:returnMatch_obj.group (1) return "'defGet_captcha ():#the captcha URL is named in the timestamp manner.Captcha_url ='HTTPS://WWW.ZHIHU.CDOM/CAPTCHA.GIF?R=%D&TYPE=LOGIN&LANG=CN'% (int (time.time () * 1000)) Response= Session.get (Captcha_url, headers=header)#Save verification code to current directoryWith open ('Captcha.gif','WB') as F:f.write (response.content) f.close ()#automatically open the verification code you just acquired fromPILImportImageTry: IMG= Image.open ('Captcha.gif') Img.show () img.close ()except: Passpoints= [[22.796875, 22], [42.796875, 22], [63.796875, 21], [84.796875, 20], [107.796875, 20], [129.796875, 22], [150.796875, 22]] Seq= Input ('Please enter the position of the inverted word \n>') s="' forIinchseq:s+ = str (points[int (i)-1]) +',' return '{"Img_size": [200,44], "input_points": [%s]}'% s[:-1]defzhihu_login (account, password):#Login is known ifRe.match ('1\d{10}', account):Print('Mobile phone Number login') Post_url='Https://www.zhihu.com/login/phone_num'Post_data= { 'Captcha_type':'cn', '_XSRF': Get_xsrf (),'Phone_num': Account,'Password': Password,'Captcha': Get_captcha (),} Response_text= Session.post (Post_url, Data=post_data, headers=header) Session.cookies.save ()if __name__=='__main__': Agent='mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) applewebkit/603.3.8 (khtml, like Gecko) version/10.1.2 safari/603.3.8'Header= { 'HOST':'www.zhihu.com', 'Referer':'https://www.zhihu.com', 'user-agent': Agent,} session=requests.session () zhihu_login ('Enter the phone number of the login','Enter password for login')
Presumably everyone read the code can read, I say a simple mouth important
very important to note is: must use the session to request the verification code and then use the session to submit the message, why not use requests? A session is a conversation, if using a session to visit a website, and then take this session back to request this site, it will bring the site to our cookie or the website put in the field of the session completely to bring back, The cookie in this is very important, when we visit, regardless of whether we have login, the server can put some value in our header, we use Pycharm debug to see the session:
You can see that there are a lot of cookies in it, the server sends us these cookies when we get the verification code, it must be passed on to the server before the authentication is successful. If you use requests when you log in, it will set up a session again, you can not get the verification code brought by the cookie to the server, which is the authentication failed.
All right, it's done! As long as the input text is inverted on the line, such as the second and fourth text is inverted, input: 24 press ENTER after the automatic addition of coordinates, is not very happy!
Python crawler scrapy Frame--Manual identification knowledge of the inverted text verification code and digital English verification code