Python crawler-Log in with cookies

Source: Internet
Author: User
Tags urlencode

Objective:

What is a cookie?

Cookies are data (usually encrypted) stored on the user's local terminal by certain websites in order to identify the user and track the session.

For example, some sites need to log in to access a page, before you log in, you want to crawl a page content is not allowed. Then we can use the Urllib library to save our registered cookies and then crawl the other pages, which is what we do.

I. Introduction of URLLIB Library

Urllib is a python built-in HTTP request library, official address: https://docs.python.org/3/library/urllib.html

The following modules are included:

>>>urllib.request Request Module

>>>urllib.error Exception Handling Module

>>>urllib.parse URL Parsing module

>>>urllib.robotparser robots.txt Parsing Module

Ii. introduction of Urllib.request.urlopen

Uurlopen commonly used has three parameters, its parameters are as follows:

Urllib.requeset.urlopen (Url,data,timeout)

A simple example:

1. Use of URL parameters (URL requested)

Response = Urllib.request.urlopen (' http://www.baidu.com ')

2. Use of the data parameter (request as POST request)

Data= bytes (Urllib.parse.urlencode ({' word ': ' Hello '}), encoding= ' UTF8 ')

response= urllib.request.urlopen (' Http://www.baidu.com/post ', data=data)

3, the use of timeout parameters (request set a time-out, not to let the program has been waiting for results)

response= urllib.request.urlopen (' Http://www.baidu.com/get ', timeout=4)

Third, tectonic requset

1. Data transfer post and get (example: Here is a list of login requests, define a dictionary as values, the parameters are: email and password, Then, using the Urllib.parse.urlencode method, the dictionary is encoded, named data, and two parameters are passed when the request is built: URL, data. Run the program, you can achieve landing. )

Get mode: Accessed directly as a link, with all parameters included in the link.

Login_url= "http://fr*****.aflt.kiwisns.com/postlogin/"

values={' email ': ' ******* @user. com ', ' Password ': ' 123456 '}

Data=urllib.parse.urlencode (values). Encode ()

Geturl = login_url+ "?" +data

Request = Urllib.request.Request (Geturl)

Post mode: The data parameter mentioned above is used here, and we are transmitting this parameter.

Login_url= ' http://fr*****.aflt.kiwisns.com/postlogin/'

values={' email ': ' ******* @user. com ', ' Password ': ' 123456 '}

Data=urllib.parse.urlencode (values). Encode ()

Request=urllib.request.request (Url,data)

2, set headers (some sites will not agree to the program directly in the way of access, if the identification of a problem, then the site will not respond, so in order to fully simulate the work of the browser, we need to set some headers properties)


Fiddler grab packet Request-headers

You can see the headers of the request, which contains a lot of information: Cache, Client, transport, and so on. Where the agent is the identity of the request, if there is no write request identity, then the server does not necessarily respond, so you can set the agent in headers.

Example: (This example just shows how to set headers)

User_agent = R ' mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) gecko/20100101 firefox/55.0 '

headers={' user-agent ': user_agent, ' Connection ': ' Keep-alive '}

Request=urllib.request.request (Url,data,headers)

Iv. using cookies to log in

1. Get Login URL

The browser enters the URL that needs to be logged in: ' Http://fr*****.aflt.kiwisns.com/login ' (note: This is not its real site login URL), use the Grab Bag tool fiddler grab (other tools can also) find the request to see after login.

Here to determine the URL to login: ' http://fr*****.aflt.kiwisns.com/postlogin/'


View the request URL for login

2. View the post data to be transferred

Find the WebForms information in the request after login, which will list the post data to be used for login, including Email,password,auth.


WebForms Information

3. View headers Information

Find the headers information for the request that you see after logging in, find out user-agent settings, connection settings, etc.


User-agent Settings, connection settings

4. Start coding and use cookies to log in to the website

5. Repeatedly use cookies to log in

(in the code above we saved the cookie to the local, the following code we can directly from the file import cookie to login, no need to build the request)

Above

Python crawler-Log in with cookies

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.