Share an instance of login using Python crawler emulation

Last Update:2017-05-27 Source: Internet

Author: User

Tags http cookie chrome developer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the reptile process, some pages in the login before is forbidden to crawl, this time need to simulate landing, the following this article is mainly to introduce the use of Python crawler simulation of the method of login to the tutorial, the article introduced in very detailed, the need for friends can refer to the following to see it together.

Objective

For those who often write reptiles, we all know that some pages are blocked before logging in, such as the knowledge of the topic page requires users to log in to access, and "login" can not be separated from the HTTP Cookie technology.

Login principle

The principle of cookies is very simple, because HTTP is a stateless protocol, so in order to maintain the session state on top of a stateless HTTP protocol, let the server know which customer is currently dealing with, cookie technology appears, cookie Equivalent to an identity that is assigned to the client by the service side.

When the browser initiates an HTTP request for the first time, it does not carry any Cookie information
The server responds with an HTTP message with a Cookie that is returned to the browser
A second request from the browser sends the Cookie information returned by the server to the server
The server receives an HTTP request and discovers that there is a cookie field in the request header and knows that it has dealt with the user before.

Actual application

It is known that you can log in once you provide your user name and password, as well as a verification code. Of course, this is just the phenomenon we see in our eyes. The hidden technical details need to be explored with the help of a browser. Now let's use Chrome to see what happens when we fill out the form.

(if already logged in, first exit) first go to the login page www.zhihu.com/#signin, open the Chrome Developer Toolbar (press F12) first try to enter an incorrect verification code to see how the browser sends the request.

Several key messages can be found from the browser's request

The URL address of the login is Https://www.zhihu.com/login/email
Login needs to provide 4 of the form data: User name (email), password (password), Verification code (CAPTCHA), _xsrf.
Gets the URL address of the verification code is Https://www.zhihu.com/captcha.gif?r=1490690391695&type=login

What is _XSRF? If you are familiar with CSRF (cross-site request forgery) attacks, then you must know what it does, xsrf is a bunch of pseudo-random numbers that are used to prevent cross-site request forgery. It usually exists in the form of the page form label, in order to confirm this, you can search the page "XSRF", sure enough, _xsrf in a hidden input tag

Once you've figured out how the data you need to log in to your browser is captured, you can now start writing code to sign in with a Python emulation browser. The two third-party libraries that you rely on to log on are requests and BeautifulSoup, which are installed first

Pip Install Beautifulsoup4==4.5.3pip Install requests==2.13.0

The Http.cookiejar module can be used to automate the processing of HTTP Cookie,lwpcookiejar objects, which is the encapsulation of cookies, which supports saving cookies to files and loading them from files.

The session object provides the persistence of the Cookie, the connection pooling function, the request can be sent through the session object

The cookie information is loaded first from the Cookies.txt file because there is no cookie on the first run and all loaderror exceptions occur.

From http Import cookiejarsession = requests.session () session.cookies = Cookiejar. Lwpcookiejar (filename= ' Cookies.txt ') try:session.cookies.load (ignore_discard=true) except Loaderror:print ("load Cookies failed ")

Get XSRF

The tag where the XSRF is located has been found, and the Beatifulsoup find method makes it very easy to get the value

Def GET_XSRF (): Response = Session.get ("https://www.zhihu.com", headers=headers) soup = BeautifulSoup (response.content , "Html.parser") xsrf = soup.find (' input ', attrs={"name": "_XSRF"}). Get ("value") return XSRF

Get Verification Code

The verification code is returned through the/captcha.gif interface, where we save the code image download to the current directory, by artificial identification, of course, you can use a third-party support library to automatically identify, such as Pytesser.

Def get_captcha (): "" "Save the Captcha image to the current directory, manually identify the verification code: return:" "t = str (int (time.time () *)) Captcha_url = ' Https://www.zhih U.com/captcha.gif?r= ' + t + "&type=login" R = Session.get (Captcha_url, Headers=headers) with open (' captcha.jpg ', ' WB ' As F:  F.write (r.content) Captcha = input ("Verification Code:") return CAPTCHA

Login

Once all the parameters are ready, you can request the login interface.

def login (email, password): Login_url = ' www.zhihu.com/login/email ' data = {  ' email ': email,  ' password ': Password,  ' _xsrf ': get_xsrf (),  "Captcha": Get_captcha (),  ' Remember_me ': ' true '} response = Session.post ( Login_url, Data=data, headers=headers) Login_code = Response.json () print (login_code[' msg ')) for I in session.cookies:< C5/>print (i) Session.cookies.save ()

Once the request is successful, the session automatically populates the Session.cookies object with the cookie information returned by the server, and the client can automatically carry these cookies to access the pages that need to be logged on the next request.

auto_login.py Sample Code

# encoding:utf-8#!/usr/bin/env python "" "Author: Liuzhijun" "" Import timefrom http import cookiejarimport requestsfrom BS4 Imp Ort beautifulsoupheaders = {"Host": "Www.zhihu.com", "Referer": "www.zhihu.com/", ' user-agent ': ' mozilla/5.0 (Macintos H Intel Mac OS X 10_10_5) applewebkit/537.36 (khtml, like Gecko) chrome/56.0.2924.87 '}# use login cookie Information session = Requests.sessi On () Session.cookies = Cookiejar.  Lwpcookiejar (filename= ' Cookies.txt ') try:print (session.cookies) session.cookies.load (ignore_discard=true) except: Print ("No cookie Information") def GET_XSRF (): Response = Session.get ("www.zhihu.com", headers=headers) soup = BeautifulSoup ( Response.content, "Html.parser") xsrf = soup.find (' input ', attrs={"name": "_XSRF"}). Get ("value") return Xsrfdef Get_cap Tcha (): "" "Save the Captcha image to the current directory, manually identify the CAPTCHA: return:" "t = str (int (time.time () *)) Captcha_url = ' Www.zhihu.com/captcha . gif?r= ' + t + "&type=login" R = Session.get (Captcha_url, Headers=headers) with open (' captcha.jpg ', ' WB ') as F:f . WRITE (r.content) Captcha = input ("Verification Code:") return captchadef login (email, password): Login_url = ' Www.zhihu.com/login/email ' data = {' email ': email, ' Password ': password, ' _xsrf ': get_xsrf (), ' Captcha ': Get_captcha (), ' Remember_ Me ': ' true '} response = Session.post (Login_url, Data=data, headers=headers) Login_code = Response.json () print (Login_co de[' msg ']) for I in Session.cookies:print (i) session.cookies.save () if name = = ' main ': email = ' xxxx ' password = ' x XXXX "Login (email, password)

"Recommended"

1. Python Crawler Primer (4)--Detailed HTML text parsing library BeautifulSoup

2. Python Crawler Introduction (3)--using requests to build a knowledge API

3. Python crawler Primer (2)--http Library requests

4. Python crawler Primer (1)--Quick understanding of HTTP protocol

5. Python Crawler Primer (5)--Regular Expression Example tutorial

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More