Python crawler simulated login knowledge, python crawler Simulation

Source: Internet
Author: User
Tags chrome developer chrome developer tools

Python crawler simulated login knowledge, python crawler Simulation

I have previously written a blog about crawling movie heaven resources using python crawlers, focusing on how to parse pages and improve crawler efficiency. Because all users have the same permission to obtain resources in movie heaven, login verification is not required. After writing that article, I spent some time studying python simulated login, there is a lot of information about this part on the Internet, and many demos are known for login. The reason is that zhihu's login is relatively simple, and only a few parameters need to be post to save the cookie. It is not encrypted yet and is suitable for teaching. I am also a newbie, and I learned a little bit about the successful login. I will share my experiences in this article and hope it will be helpful for beginners like me.

First, let's take a look at the basic principle of crawler simulated login. I was just getting started and didn't know much about some deep-seated things. Cookie is an important concept. We all know that HTTP is a stateless protocol. That is to say, when a browser client submits a request to the server and the server responds to a response, the connection between them is interrupted. As a result, when the client sends a request to the server, the server cannot determine whether the two clients are one. This is definitely not the case. At this time, the role of cookie is reflected. After the client sends a request to the server, the server assigns it a cookie and saves it to the client, when the client sends a request again next time, it sends the request together with a cookie to the server. When the server sees the cookie, it turns out to be you. This is your thing. Take it away. To simulate a crawler login, You need to simulate the behavior of a browser client. First, you need to send your basic login information to the specified url. After the server is successfully verified, a cookie is returned, we can use this cookie for subsequent crawling.

I use chrome developer tools to capture packets here, but you can also use Fiddler and Firebug, but as a front-end er, chrome has a special liking. Prepare the tool next we need to open the login page and view the https://www.zhihu.com/#signin we can easily find that this request is sent login information, of course, if I use a mobile phone to log on via email, the last and last logon is email.

So we only need to post data to this address.

Phone_num Login Name

Password

Captcha_type verification code type (this parameter does not actually work here)

Rember_me Remember password

_ Xsrf a hidden form element is used to defend against CSRF (open here for CSRF) I found that this value is fixed, so I am writing it here. If you are interested, you can write a regular expression to extract this part of the value, which is more rigorous.

#-*-Coding: UTF-8-*-import urllib2import urllibimport cookielibposturl = 'https: // users {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) ''applewebkit/537.36 (KHTML, like Gecko) ''Chrome/52.0.2743.116 Safari/537.36 ', 'Referer': 'https: // www.zhihu.com/'login value = {'Password ': '******************', 'Remember _ me ': True, 'phone _ num ': '********************', '_ xsrf ': *********************** '} data = urllib. urlencode (value) # initialize a CookieJar to process CookiecookieJar = cookielib. cookieJar () cookie_support = urllib2.HTTPCookieProcessor (cookieJar) # instantiate a global openeropener = urllib2.build _ opener (cookie_support) request = urllib2.Request (posturl, data, headers) result = opener. open (request) print result. read ()

When you see this message returned by the server, it indicates that you have successfully logged on to the server.

{"R": 0, "msg": "\ u767b \ u5f55 \ u6210 \ u529f "}
# The words "Login successful" are translated

Then you can use this identity to capture the page on zhihu.

page=opener.open("https://www.zhihu.com/people/yu-yi-56-70")content = page.read().decode('utf-8')print(content)

This code instantiates an opener object to save the cookie information after successful login, and then uses this opener to access the complete page about this identity on the server. More complicated, such as Weibo login, which encrypts the requested data and writes it out later, so as to share it with you.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.