Python crawler simulation zhihu login method tutorial, python Crawler

Source: Internet
Author: User

Python crawler simulation zhihu login method tutorial, python Crawler

Preface

As we all know, some pages are forbidden to be crawled before logon. For example, the topic page of zhihu requires users to log in to access the page, however, "login" is inseparable from the Cookie technology in HTTP.

Logon Principle

The principle of Cookie is very simple, because HTTP is a stateless protocol, in order to maintain the session status on the stateless HTTP protocol, let the server know which customer it is dealing with. The Cookie technology appears, and the Cookie is equivalent to an identifier assigned to the client by the server.

  • When the browser initiates an HTTP request for the first time, it does not carry any Cookie information
  • The server returns the HTTP response along with a Cookie to the browser.
  • In the second request, the browser sends the Cookie information returned by the server together to the server.
  • When the server receives an HTTP request and finds that there is a Cookie field in the Request Header, it knows that it has dealt with this user before.

Practical Application

Anyone who knows it knows that you can log on after providing the user name, password, and verification code. Of course, this is what we see in our eyes. The hidden technical details need to be mined using browsers. Now we will use Chrome to check what happened after we fill out the form?

(If you have already logged on, exit first) first enter the login page https://www.zhihu.com/#signin of zhihu, open Chrome developer toolbar (Press F12) first try to enter an incorrect verification code to observe how the browser sends the request.

Several key information can be found from browser requests

  • The login URL is https://www.zhihu.com/login/email
  • Four form data are required for Logon: User Name (email), password (password), verification code (captcha), and _ xsrf.
  • Is the URL for obtaining the verification code https://www.zhihu.com/captcha.gif? R = 1490690391695 & type = login

_ What is xsrf? If you are familiar with CSRF (Cross-Site Request Forgery) attacks, you must be aware of its role. xsrf is a string of pseudo-random numbers used to prevent cross-site request forgery. It usually exists in the form tag of the webpage. To verify this, you can search "xsrf" on the page. Sure enough, _ xsrf is in a hidden input tag.

After finding out how the data required for browser login is obtained, you can now write code and use Python to simulate browser login. The two third-party libraries required for Logon are requests and BeautifulSoup. Install

pip install beautifulsoup4==4.5.3pip install requests==2.13.0

The http. cookiejar module can be used to automatically process HTTP cookies. The LWPCookieJar object is the encapsulation of cookies. It supports saving cookies to files and loading them from files.

The session Object provides Cookie persistence and connection pool function, which can send requests through the session object.

First, load the cookie information from the cookies.txt file. Because no cookie exists during the first operation, a LoadError exception occurs.

from http import cookiejarsession = requests.session()session.cookies = cookiejar.LWPCookieJar(filename='cookies.txt')try: session.cookies.load(ignore_discard=True)except LoadError: print("load cookies failed")

Obtain xsrf

The tag where xsrf is located has been found. You can use the find method of BeatifulSoup to conveniently obtain this value.

def get_xsrf(): response = session.get("https://www.zhihu.com", headers=headers) soup = BeautifulSoup(response.content, "html.parser") xsrf = soup.find('input', attrs={"name": "_xsrf"}).get("value") return xsrf

Get Verification Code

The verification code is returned through the/captcha.gif interface. Here we download and save the verification code image to the current directory for manual identification. Of course, you can use a third-party support library for automatic identification, such as pytesser.

Def get_captcha (): "Save the verification code image to the current directory and manually identify the verification code: return:" t = str (int (time. time () * 1000) captcha_url = 'https: // www.zhihu.com/captcha.gif? R = '+ t + "& type = login" r = session. get (captcha_url, headers = headers) with open('captcha.jpg ', 'wb') as f: f. write (r. content) captcha = input ("Verification Code:") return captcha

Login

After all the parameters are ready, you can request the logon interface.

def login(email, password): login_url = 'https://www.zhihu.com/login/email' data = {  'email': email,  'password': password,  '_xsrf': get_xsrf(),  "captcha": get_captcha(),  'remember_me': 'true'} response = session.post(login_url, data=data, headers=headers) login_code = response.json() print(login_code['msg']) for i in session.cookies:  print(i) session.cookies.save()

After the request is successful, the session automatically fills in the cookie information returned by the server to the session. in the cookie object, the client can automatically carry these cookies to access the pages to be logged on.

Auto_login.py sample code

# Encoding: UTF-8 #! /Usr/bin/env python "prepared by: liuzhijun" import timefrom http import cookiejarimport requestsfrom bs4 import records = {"Host": "www.zhihu.com", "Referer ": "https://www.zhihu.com/", 'user-agent': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) chrome/56.0.2924.87 '} # Use logon cookie information session = requests. session () session. cookies = cookiejar. LWPCookieJar (fil Ename='cookies.txt ') try: print (session. cookies) session. cookies. load (ignore_discard = True) handle T: print ("No cookie information") def get_xsrf (): response = session. get ("https://www.zhihu.com", headers = headers) soup = BeautifulSoup (response. content, "html. parser ") xsrf = soup. find ('input', attrs = {"name": "_ xsrf "}). get ("value") return xsrfdef get_captcha (): "" Save the verification code image to the current directory and manually identify the verification code: return: "t = str (int (tim E. time () * 1000) captcha_url = 'https: // www.zhihu.com/captcha.gif? R = '+ t + "& type = login" r = session. get (captcha_url, headers = headers) with open('captcha.jpg ', 'wb') as f: f. write (r. content) captcha = input ("Verification Code:") return captchadef login (email, password): login_url = 'https: // www.zhihu.com/login/email' data = {'email ': email, 'Password': password, '_ xsrf': get_xsrf (), "captcha": get_captcha (), 'Remember _ me': 'true'} response = session. post (login_url, data = data, headers = headers) login_code = response. json () print (login_code ['msg ']) for I in session. cookies: print (I) session. cookies. save () if _ name _ = '_ main _': email = "xxxx" password = "xxxxx" login (email, password)

Github Source Code address: https://github.com/lzjun567/crawler_html2pdf/blob/master/zhihu/auto_login.py

Summary

The above is all about Python crawler simulation and login. I hope this article will help you learn or use python. If you have any questions, please leave a message, thank you for your support.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.