In the reptile process, some pages in the login before is forbidden to crawl, this time need to simulate landing, the following this article is mainly to introduce the use of Python crawler simulation of the method of login to the tutorial, the article introduced in very detailed, the need for friends can refer to the following to see it together.
Objective
For those who often write reptiles, we all know that some pages are blocked before logging in, such as the knowledge of the topic page requires users to log in to access, and "login" can not be separated from the HTTP Cookie technology.
Login principle
The principle of cookies is very simple, because HTTP is a stateless protocol, so in order to maintain the session state on top of a stateless HTTP protocol, let the server know which customer is currently dealing with, cookie technology appears, cookie Equivalent to an identity that is assigned to the client by the service side.
When the browser initiates an HTTP request for the first time, it does not carry any Cookie information
The server responds with an HTTP message with a Cookie that is returned to the browser
A second request from the browser sends the Cookie information returned by the server to the server
The server receives an HTTP request and discovers that there is a cookie field in the request header and knows that it has dealt with the user before.
Actual application
It is known that you can log in once you provide your user name and password, as well as a verification code. Of course, this is just the phenomenon we see in our eyes. The hidden technical details need to be explored with the help of a browser. Now let's use Chrome to see what happens when we fill out the form.
(if already logged in, first exit) first go to the login page www.zhihu.com/#signin, open the Chrome Developer Toolbar (press F12) first try to enter an incorrect verification code to see how the browser sends the request.
Several key messages can be found from the browser's request
The URL address of the login is Https://www.zhihu.com/login/email
Login needs to provide 4 of the form data: User name (email), password (password), Verification code (CAPTCHA), _xsrf.
Gets the URL address of the verification code is Https://www.zhihu.com/captcha.gif?r=1490690391695&type=login
What is _XSRF? If you are familiar with CSRF (cross-site request forgery) attacks, then you must know what it does, xsrf is a bunch of pseudo-random numbers that are used to prevent cross-site request forgery. It usually exists in the form of the page form label, in order to confirm this, you can search the page "XSRF", sure enough, _xsrf in a hidden input tag
Once you've figured out how the data you need to log in to your browser is captured, you can now start writing code to sign in with a Python emulation browser. The two third-party libraries that you rely on to log on are requests and BeautifulSoup, which are installed first
Pip Install Beautifulsoup4==4.5.3pip Install requests==2.13.0
The Http.cookiejar module can be used to automate the processing of HTTP Cookie,lwpcookiejar objects, which is the encapsulation of cookies, which supports saving cookies to files and loading them from files.
The session object provides the persistence of the Cookie, the connection pooling function, the request can be sent through the session object
The cookie information is loaded first from the Cookies.txt file because there is no cookie on the first run and all loaderror exceptions occur.
From http Import cookiejarsession = requests.session () session.cookies = Cookiejar. Lwpcookiejar (filename= ' Cookies.txt ') try:session.cookies.load (ignore_discard=true) except Loaderror:print ("load Cookies failed ")
Get XSRF
The tag where the XSRF is located has been found, and the Beatifulsoup find method makes it very easy to get the value
Def GET_XSRF (): Response = Session.get ("https://www.zhihu.com", headers=headers) soup = BeautifulSoup (response.content , "Html.parser") xsrf = soup.find (' input ', attrs={"name": "_XSRF"}). Get ("value") return XSRF
Get Verification Code
The verification code is returned through the/captcha.gif interface, where we save the code image download to the current directory, by artificial identification, of course, you can use a third-party support library to automatically identify, such as Pytesser.
Def get_captcha (): "" "Save the Captcha image to the current directory, manually identify the verification code: return:" "t = str (int (time.time () *)) Captcha_url = ' Https://www.zhih U.com/captcha.gif?r= ' + t + "&type=login" R = Session.get (Captcha_url, Headers=headers) with open (' captcha.jpg ', ' WB ' As F: F.write (r.content) Captcha = input ("Verification Code:") return CAPTCHA
Login
Once all the parameters are ready, you can request the login interface.
def login (email, password): Login_url = ' www.zhihu.com/login/email ' data = { ' email ': email, ' password ': Password, ' _xsrf ': get_xsrf (), "Captcha": Get_captcha (), ' Remember_me ': ' true '} response = Session.post ( Login_url, Data=data, headers=headers) Login_code = Response.json () print (login_code[' msg ')) for I in session.cookies:< C5/>print (i) Session.cookies.save ()
Once the request is successful, the session automatically populates the Session.cookies object with the cookie information returned by the server, and the client can automatically carry these cookies to access the pages that need to be logged on the next request.
auto_login.py Sample Code
# encoding:utf-8#!/usr/bin/env python "" "Author: Liuzhijun" "" Import timefrom http import cookiejarimport requestsfrom BS4 Imp Ort beautifulsoupheaders = {"Host": "Www.zhihu.com", "Referer": "www.zhihu.com/", ' user-agent ': ' mozilla/5.0 (Macintos H Intel Mac OS X 10_10_5) applewebkit/537.36 (khtml, like Gecko) chrome/56.0.2924.87 '}# use login cookie Information session = Requests.sessi On () Session.cookies = Cookiejar. Lwpcookiejar (filename= ' Cookies.txt ') try:print (session.cookies) session.cookies.load (ignore_discard=true) except: Print ("No cookie Information") def GET_XSRF (): Response = Session.get ("www.zhihu.com", headers=headers) soup = BeautifulSoup ( Response.content, "Html.parser") xsrf = soup.find (' input ', attrs={"name": "_XSRF"}). Get ("value") return Xsrfdef Get_cap Tcha (): "" "Save the Captcha image to the current directory, manually identify the CAPTCHA: return:" "t = str (int (time.time () *)) Captcha_url = ' Www.zhihu.com/captcha . gif?r= ' + t + "&type=login" R = Session.get (Captcha_url, Headers=headers) with open (' captcha.jpg ', ' WB ') as F:f . WRITE (r.content) Captcha = input ("Verification Code:") return captchadef login (email, password): Login_url = ' Www.zhihu.com/login/email ' data = {' email ': email, ' Password ': password, ' _xsrf ': get_xsrf (), ' Captcha ': Get_captcha (), ' Remember_ Me ': ' true '} response = Session.post (Login_url, Data=data, headers=headers) Login_code = Response.json () print (Login_co de[' msg ']) for I in Session.cookies:print (i) session.cookies.save () if name = = ' main ': email = ' xxxx ' password = ' x XXXX "Login (email, password)
"Recommended"
1. Python Crawler Primer (4)--Detailed HTML text parsing library BeautifulSoup
2. Python Crawler Introduction (3)--using requests to build a knowledge API
3. Python crawler Primer (2)--http Library requests
4. Python crawler Primer (1)--Quick understanding of HTTP protocol
5. Python Crawler Primer (5)--Regular Expression Example tutorial