Python crawler-Log in with cookies

Last Update:2017-09-24 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective:

What is a cookie?

Cookies are data (usually encrypted) stored on the user's local terminal by certain websites in order to identify the user and track the session.

For example, some sites need to log in to access a page, before you log in, you want to crawl a page content is not allowed. Then we can use the Urllib library to save our registered cookies and then crawl the other pages, which is what we do.

I. Introduction of URLLIB Library

Urllib is a python built-in HTTP request library, official address: https://docs.python.org/3/library/urllib.html

The following modules are included:

>>>urllib.request Request Module

>>>urllib.error Exception Handling Module

>>>urllib.parse URL Parsing module

>>>urllib.robotparser robots.txt Parsing Module

Ii. introduction of Urllib.request.urlopen

Uurlopen commonly used has three parameters, its parameters are as follows:

Urllib.requeset.urlopen (Url,data,timeout)

A simple example:

1. Use of URL parameters (URL requested)

Response = Urllib.request.urlopen (' http://www.baidu.com ')

2. Use of the data parameter (request as POST request)

Data= bytes (Urllib.parse.urlencode ({' word ': ' Hello '}), encoding= ' UTF8 ')

response= urllib.request.urlopen (' Http://www.baidu.com/post ', data=data)

3, the use of timeout parameters (request set a time-out, not to let the program has been waiting for results)

response= urllib.request.urlopen (' Http://www.baidu.com/get ', timeout=4)

Third, tectonic requset

1. Data transfer post and get (example: Here is a list of login requests, define a dictionary as values, the parameters are: email and password, Then, using the Urllib.parse.urlencode method, the dictionary is encoded, named data, and two parameters are passed when the request is built: URL, data. Run the program, you can achieve landing. ）

Get mode: Accessed directly as a link, with all parameters included in the link.

Login_url= "http://fr*****.aflt.kiwisns.com/postlogin/"

values={' email ': ' ******* @user. com ', ' Password ': ' 123456 '}

Data=urllib.parse.urlencode (values). Encode ()

Geturl = login_url+ "?" +data

Request = Urllib.request.Request (Geturl)

Post mode: The data parameter mentioned above is used here, and we are transmitting this parameter.

Login_url= ' http://fr*****.aflt.kiwisns.com/postlogin/'

values={' email ': ' ******* @user. com ', ' Password ': ' 123456 '}

Data=urllib.parse.urlencode (values). Encode ()

Request=urllib.request.request (Url,data)

2, set headers (some sites will not agree to the program directly in the way of access, if the identification of a problem, then the site will not respond, so in order to fully simulate the work of the browser, we need to set some headers properties)

Fiddler grab packet Request-headers

You can see the headers of the request, which contains a lot of information: Cache, Client, transport, and so on. Where the agent is the identity of the request, if there is no write request identity, then the server does not necessarily respond, so you can set the agent in headers.

Example: (This example just shows how to set headers)

User_agent = R ' mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) gecko/20100101 firefox/55.0 '

headers={' user-agent ': user_agent, ' Connection ': ' Keep-alive '}

Request=urllib.request.request (Url,data,headers)

Iv. using cookies to log in

1. Get Login URL

The browser enters the URL that needs to be logged in: ' Http://fr*****.aflt.kiwisns.com/login ' (note: This is not its real site login URL), use the Grab Bag tool fiddler grab (other tools can also) find the request to see after login.

Here to determine the URL to login: ' http://fr*****.aflt.kiwisns.com/postlogin/'

View the request URL for login

2. View the post data to be transferred

Find the WebForms information in the request after login, which will list the post data to be used for login, including Email,password,auth.

WebForms Information

3. View headers Information

Find the headers information for the request that you see after logging in, find out user-agent settings, connection settings, etc.

User-agent Settings, connection settings

4. Start coding and use cookies to log in to the website

5. Repeatedly use cookies to log in

(in the code above we saved the cookie to the local, the following code we can directly from the file import cookie to login, no need to build the request)

Above

Python crawler-Log in with cookies

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More