There are two common ways to sign in:
- View login page, Csrf,cookie, authorization, cookie
- Send a POST request directly to get a cookie
The above is just a simple description, the following is a detailed approach to the two types of log on when the crawler processing methods
First case
This example is actually more, now a lot of Web site login is the first method, here by using GitHub as an example:
Analysis page
Get Authenticity_token Information
We all know the login page here is a form form submission, which I can analyze with Google Chrome
If we find this token message,
So we should first access this login page by code to get this authenticity_token information before login
Get cookie information on landing page
Set-cookie here is the cookie for the login page
Analyze the login package to get the submission address
When we enter a username and password and click Submit, we can find the address from the package, which is the POST request to submit the form information
Requested address: Https://github.com/session
The requested parameters are:
"Commit": "Sign In",
"UTF8": "?",
"Authenticity_token": "km6q0mm9fti95wysi/wu3bnambyrmv60c0ytqlzjbuauya193lp2gd8btcmqbsfvpfzrlk3/1tfonoggudy7ig== ”,
"Login": "[email protected]",
"Password": "123"
From here we can also see the "Authenticity_token" in the submission parameters, and this parameter is required from the landing page to get first.
When we log in successfully:
Visit GitHub again, and this time the cookie adds two cookie information, which is the information added after login
So if we want to log in through the program, we need to get cookie information again after the login is successful.
Then use this cookie to access other information on our github such as our Personal Information Settings page:
Https://github.com/settings/profile
Code implementation
The following code implements the login and access https://github.com/settings/repositories
ImportRequests fromBs4ImportBeautifulsoupbase_url="Https://github.com/login"Login_url="https://github.com/session"defget_github_html (URL):" "This is used to get the HTML for the login page, as well as the cookie:p Aram Url:https://github.com/login:return: The HTML for the login page, and the first Cooke" "Response=requests.get (URL) first_cookie=response.cookies.get_dict ()returnResponse.text,first_cookiedefGet_token (HTML):" "HTML:p Aram HTML:: return: Get Csrftoken for the post-login page" "Soup= BeautifulSoup (HTML,'lxml') Res= Soup.find ("input", attrs={"name":"Authenticity_token"}) Token= res["value"] returntokendefGihub_login (Url,token,cookie):" "This is a cookie used to log in:p Aram url:https://github.com/session:p Aram Token:csrftoken:p Aram Cookie: The first time you log in: Return: Returns the Cooke after the first and second merges" "Data= { "Commit":" Sign In", "UTF8":"?", "Authenticity_token": Token,"Login":"your GitHub account", "Password":"ru10150417521"} response= Requests.post (url,data=data,cookies=cookies)Print(response.status_code) Cookies=response.cookies.get_dict ()#The explanatory note here is because GitHub was previously merging two of times by a cookie. #not now, but you can get it straight. #cookie.update (Second_cookie) returnCookiesif __name__=='__main__': Html,cookie=get_github_html (base_url) token=Get_token (HTML) cookie=Gihub_login (Login_url,token,cookie) Response= Requests.get ("https://github.com/settings/repositories", cookies=cookies)Print(Response.text)
The second case
Here through the Bole online as an example, this is relatively simple compared to the first, there is not too much analysis process directly send a POST request, and then get a cookie, through the cookie to access other pages, the following is a code implementation example:
http://www.jobbole.com/bookmark/ This address is a page that can only be accessed after login, otherwise it will return directly to the login page
Here's the point:http://www.jobbole.com/wp-admin/admin-ajax.php is the request address of the login which can be seen in the clutch.
ImportRequestsdeflogin (): URL="http://www.jobbole.com/wp-admin/admin-ajax.php"Data= { "Action":"User_login", "User_login":"zhaofan1015", "User_pass":'******',} response=requests.post (url,data) Cookie=response.cookies.get_dict ()Print(cookie) url2="http://www.jobbole.com/bookmark/"Response2= Requests.get (url2,cookies=cookies)Print(response2.text) login ()
Python Reptile Blog about login