Objective:
What is a cookie?
Cookies are data (usually encrypted) stored on the user's local terminal by certain websites in order to identify the user and track the session.
For example, some sites need to log in to access a page, before you log in, you want to crawl a page content is not allowed. Then we can use the Urllib library to save our registered cookies and then crawl the other pages, which is what we do.
I. Introduction of URLLIB Library
Urllib is a python built-in HTTP request library, official address: https://docs.python.org/3/library/urllib.html
The following modules are included:
>>>urllib.request Request Module
>>>urllib.error Exception Handling Module
>>>urllib.parse URL Parsing module
>>>urllib.robotparser robots.txt Parsing Module
Ii. introduction of Urllib.request.urlopen
Uurlopen commonly used has three parameters, its parameters are as follows:
Urllib.requeset.urlopen (Url,data,timeout)
A simple example:
1. Use of URL parameters (URL requested)
Response = Urllib.request.urlopen (' http://www.baidu.com ')
2. Use of the data parameter (request as POST request)
Data= bytes (Urllib.parse.urlencode ({' word ': ' Hello '}), encoding= ' UTF8 ')
response= urllib.request.urlopen (' Http://www.baidu.com/post ', data=data)
3, the use of timeout parameters (request set a time-out, not to let the program has been waiting for results)
response= urllib.request.urlopen (' Http://www.baidu.com/get ', timeout=4)
Third, tectonic requset
1. Data transfer post and get (example: Here is a list of login requests, define a dictionary as values, the parameters are: email and password, Then, using the Urllib.parse.urlencode method, the dictionary is encoded, named data, and two parameters are passed when the request is built: URL, data. Run the program, you can achieve landing. )
Get mode: Accessed directly as a link, with all parameters included in the link.
Login_url= "http://fr*****.aflt.kiwisns.com/postlogin/"
values={' email ': ' ******* @user. com ', ' Password ': ' 123456 '}
Data=urllib.parse.urlencode (values). Encode ()
Geturl = login_url+ "?" +data
Request = Urllib.request.Request (Geturl)
Post mode: The data parameter mentioned above is used here, and we are transmitting this parameter.
Login_url= ' http://fr*****.aflt.kiwisns.com/postlogin/'
values={' email ': ' ******* @user. com ', ' Password ': ' 123456 '}
Data=urllib.parse.urlencode (values). Encode ()
Request=urllib.request.request (Url,data)
2, set headers (some sites will not agree to the program directly in the way of access, if the identification of a problem, then the site will not respond, so in order to fully simulate the work of the browser, we need to set some headers properties)
Fiddler grab packet Request-headers
You can see the headers of the request, which contains a lot of information: Cache, Client, transport, and so on. Where the agent is the identity of the request, if there is no write request identity, then the server does not necessarily respond, so you can set the agent in headers.
Example: (This example just shows how to set headers)
User_agent = R ' mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) gecko/20100101 firefox/55.0 '
headers={' user-agent ': user_agent, ' Connection ': ' Keep-alive '}
Request=urllib.request.request (Url,data,headers)
Iv. using cookies to log in
1. Get Login URL
The browser enters the URL that needs to be logged in: ' Http://fr*****.aflt.kiwisns.com/login ' (note: This is not its real site login URL), use the Grab Bag tool fiddler grab (other tools can also) find the request to see after login.
Here to determine the URL to login: ' http://fr*****.aflt.kiwisns.com/postlogin/'
View the request URL for login
2. View the post data to be transferred
Find the WebForms information in the request after login, which will list the post data to be used for login, including Email,password,auth.
WebForms Information
3. View headers Information
Find the headers information for the request that you see after logging in, find out user-agent settings, connection settings, etc.
User-agent Settings, connection settings
4. Start coding and use cookies to log in to the website
5. Repeatedly use cookies to log in
(in the code above we saved the cookie to the local, the following code we can directly from the file import cookie to login, no need to build the request)
Above
Python crawler-Log in with cookies