Python crawler simulated login knowledge, python crawler Simulation

Last Update:2016-09-23 Source: Internet

Author: User

Tags chrome developer chrome developer tools

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python crawler simulated login knowledge, python crawler Simulation

I have previously written a blog about crawling movie heaven resources using python crawlers, focusing on how to parse pages and improve crawler efficiency. Because all users have the same permission to obtain resources in movie heaven, login verification is not required. After writing that article, I spent some time studying python simulated login, there is a lot of information about this part on the Internet, and many demos are known for login. The reason is that zhihu's login is relatively simple, and only a few parameters need to be post to save the cookie. It is not encrypted yet and is suitable for teaching. I am also a newbie, and I learned a little bit about the successful login. I will share my experiences in this article and hope it will be helpful for beginners like me.

First, let's take a look at the basic principle of crawler simulated login. I was just getting started and didn't know much about some deep-seated things. Cookie is an important concept. We all know that HTTP is a stateless protocol. That is to say, when a browser client submits a request to the server and the server responds to a response, the connection between them is interrupted. As a result, when the client sends a request to the server, the server cannot determine whether the two clients are one. This is definitely not the case. At this time, the role of cookie is reflected. After the client sends a request to the server, the server assigns it a cookie and saves it to the client, when the client sends a request again next time, it sends the request together with a cookie to the server. When the server sees the cookie, it turns out to be you. This is your thing. Take it away. To simulate a crawler login, You need to simulate the behavior of a browser client. First, you need to send your basic login information to the specified url. After the server is successfully verified, a cookie is returned, we can use this cookie for subsequent crawling.

I use chrome developer tools to capture packets here, but you can also use Fiddler and Firebug, but as a front-end er, chrome has a special liking. Prepare the tool next we need to open the login page and view the https://www.zhihu.com/#signin we can easily find that this request is sent login information, of course, if I use a mobile phone to log on via email, the last and last logon is email.

So we only need to post data to this address.

Phone_num Login Name

Password

Captcha_type verification code type (this parameter does not actually work here)

Rember_me Remember password

_ Xsrf a hidden form element is used to defend against CSRF (open here for CSRF) I found that this value is fixed, so I am writing it here. If you are interested, you can write a regular expression to extract this part of the value, which is more rigorous.

#-*-Coding: UTF-8-*-import urllib2import urllibimport cookielibposturl = 'https: // users {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) ''applewebkit/537.36 (KHTML, like Gecko) ''Chrome/52.0.2743.116 Safari/537.36 ', 'Referer': 'https: // www.zhihu.com/'login value = {'Password ': '******************', 'Remember _ me ': True, 'phone _ num ': '********************', '_ xsrf ': *********************** '} data = urllib. urlencode (value) # initialize a CookieJar to process CookiecookieJar = cookielib. cookieJar () cookie_support = urllib2.HTTPCookieProcessor (cookieJar) # instantiate a global openeropener = urllib2.build _ opener (cookie_support) request = urllib2.Request (posturl, data, headers) result = opener. open (request) print result. read ()

When you see this message returned by the server, it indicates that you have successfully logged on to the server.

{"R": 0, "msg": "\ u767b \ u5f55 \ u6210 \ u529f "}
# The words "Login successful" are translated

Then you can use this identity to capture the page on zhihu.

page=opener.open("https://www.zhihu.com/people/yu-yi-56-70")content = page.read().decode('utf-8')print(content)

This code instantiates an opener object to save the cookie information after successful login, and then uses this opener to access the complete page about this identity on the server. More complicated, such as Weibo login, which encrypts the requested data and writes it out later, so as to share it with you.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More