Python Cookie crawler processing

Last Update:2017-08-21 Source: Internet

Author: User

Tags http cookie http post urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A cookie is a text file that is stored in a user's browser in order to identify the user and to track the session, and the cookie can keep the login information to the user's next session with the server.

Cookie principle

HTTP is a stateless, connection-oriented protocol, in order to maintain the connection state, the introduction of a cookie mechanism cookie is an attribute in the HTTP message header, including:

Cookie名字（Name）Cookie的值（Value）Cookie的过期时间（Expires/Max-Age）Cookie作用路径（Path）Cookie所在域名（Domain），使用Cookie进行安全连接（Secure）。前两个参数是Cookie应用的必要条件，另外，还包括Cookie大小（Size，不同浏览器对Cookie个数及大小限制是有差异的）。

The cookie is made up of variable names and values, and according to Netscape, the cookie format is as follows:

Set－Cookie: NAME=VALUE；Expires=DATE；Path=PATH；Domain=DOMAIN_NAME；SECURE

Cookie Application

The most typical use of cookies in reptiles is to determine if a registered user is already logged in to the site, and users may be prompted whether or not to retain user information the next time they enter the site to simplify the login process.

# Get a cookie with login information analog login import urllib2# 1. Build a headers message for a logged-in user headers = {"Host": "Www.renren.com", "Connection": "Keep-alive", "upgrade-insecure-requests" : "1", "user-agent": "mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.99 safari/537.36 "," Accept ":" Text/html,application/xhtml    +xml,application/xml;q=0.9,image/webp,*/*;q=0.8 "," Accept-language ":" zh-cn,zh;q=0.8,en;q=0.6 ", # Convenient for terminal reading, indicating that compressed files are not supported # Accept-encoding:gzip, deflate, SDCH, # Focus: This cookie is a cookie for a user who has stored a password without having to log in repeatedly, and this cookie records the user name, password (usually RAS encrypted) "Cookie" : "ANONYMID=IXRNA3FYSUFNWV; DEPOVINCE=GW; _r01_=1; JSESSIONID=ABCMADHEDQILM7RIY5IMV; jebe_key=f6fb270b-d06d-42e6-8b53-e67c3156aa7e%7cc13c37f53bca9e1e7132d4b58ce00fa3%7c1484060607478%7c1% 7c1484060607173; jebecookies=26fb58d1-cbe7-4fc3-a4ad-592233d1b42e| | | | |; ick_login=1f2b895d-34c7-4a1d-afb7-d84666fad409; _de=bf09ee3a28ded52e6b65f6a4705d973f1383380866d39ff5; p=99e54330ba9f910b02e6b08058f780479; ap=327550029; First_login_flag=1; [email protected]; ln_hurl=http://hdn.xnimg.cn/photos/hdn521/20140529/1055/h_main_9a3z_e0c300019f6a195a.jpg; t=214ca9a28f70ca6aa0801404dda4f6789; societyguester=214ca9a28f70ca6aa0801404dda4f6789; id=327550029; XNSID=745033C5; ver=7.0; Loginfrom=syshome "}# 2. Build the Request object Urllib2.request ("http://www.renren.com/", headers = headers) # 3 through the header information (mainly cookie information) in the headers. Direct access to the Renren home page, the server will be based on the headers header information (mainly cookie information) to determine that this is a logged-in user, and return to the corresponding page response = Urllib2.urlopen (Request) # 4. Print the response content Print Response.read ()

But this is too complicated, we need to login to the account in the browser, and set the password to save, and by grasping the packet to obtain this cookie, there is a more simple and convenient way?

Cookielib Libraries and Httpcookieprocessor processors

Processing cookies in Python is generally done by cookielib using the module and the processor class of the URLLIB2 module HTTPCookieProcessor .

cookielibModule: The main function is to provide the object for storing cookies

HTTPCookieProcessorProcessor: The primary role is to process these cookie objects and build handler objects.

Cookielib Library

The main objects of the module are Cookiejar, Filecookiejar, Mozillacookiejar, Lwpcookiejar.

Cookiejar: An object that manages HTTP cookie values, stores the cookies generated by HTTP requests, and adds cookies to outgoing HTTP requests. The entire cookie is stored in memory and the cookie is lost after the Cookiejar instance is garbage collected.

Filecookiejar (Filename,delayload=none,policy=none): derived from Cookiejar, used to create Filecookiejar instances, retrieve cookie information, and store cookies in a file. FileName is the file name where the cookie is stored. When Delayload is true, it supports deferred access to files that read files or store data in files only when needed.

Mozillacookiejar (Filename,delayload=none,policy=none): Derived from Filecookiejar, creates Mozilla浏览器 cookies.txt兼容 a Filecookiejar instance with.

Lwpcookiejar (Filename,delayload=none,policy=none): derived from Filecookiejar to create libwww-perl标准的 Set-Cookie3 文件格式 a compatible Filecookiejar instance.

In fact, in most cases, we only use Cookiejar (), if we need to interact with local files, use Mozillacookjar () or Lwpcookiejar ()

Let's do a few cases:

1) Get the cookie and save it to the Cookiejar () object

# Urllib2_cookielibtest1.pyimport Urllib2import cookielib# Constructs a Cookiejar object instance to hold Cookiecookiejar = Cookielib. Cookiejar () # uses Httpcookieprocessor () to create a cookie processor object with a parameter of Cookiejar () object handler=urllib2. Httpcookieprocessor (Cookiejar) # Builds Openeropener = Urllib2.build_opener (handler) # 4 through Build_opener (). Access to the page in the Get method, after which the cookie is automatically saved to Cookiejar Opener.open ("http://www.baidu.com") # # You can print the saved cookie in the standard format cookiestr = "" For Item in Cookiejar:    cookiestr = cookiestr + item.name + "=" + Item.value + ";" # # Go to the last semicolon print cookiestr[:-1]

We use the above method to save the cookie to the Cookiejar object, and then print out the value of the cookie, that is, the cookie value to visit the homepage of Baidu.

The results of the operation are as follows:

BAIDUID=4327A58E63A92B73FF7A297FB3B2B4D0:FG=1;BIDUPSID=4327A58E63A92B73FF7A297FB3B2B4D0;H_PS_PSSID=1429_21115_17001_21454_21409_21554_21398;PSTM=1480815736;BDSVRTM=0;BD_HOME=0

2. Access to the website to obtain a cookie and to keep the obtained cookie in a cookie file

# urllib2_cookielibtest2.pyimport Cookielibimport urllib2# Save cookie Local Disk file name filename = ' Cookie.txt ' # Declares a mozillacookiejar (with Save implementation) object instance to hold the cookie, and then writes the file Cookiejar = Cookielib. Mozillacookiejar (filename) # uses httpcookieprocessor () to create a cookie processor object with a parameter of Cookiejar () object handler = Urllib2. Httpcookieprocessor (Cookiejar) # through Build_opener () to build Openeropener = Urllib2.build_opener (handler) # Create a request, Principle with urllib2 Urlopenresponse = Opener.open ("http://www.baidu.com") # Save cookie to local file Cookiejar.save ()

3. Obtain cookies from the file as part of the request to access

# urllib2_cookielibtest2.pyimport Cookielibimport urllib2# Create Mozillacookiejar (with load Implementation) Instance Object Cookiejar = Cookielib. Mozillacookiejar () # Read cookie content from file to variable cookie.load (' Cookie.txt ') # Use Httpcookieprocessor () to create cookie processor object, The parameter is Cookiejar () object handler = Urllib2. Httpcookieprocessor (Cookiejar) # through Build_opener () to build Openeropener = Urllib2.build_opener (handler) response = Opener.open ("http://www.baidu.com")

Import Urllibimport urllib2import cookielib# 1. Build a Cookiejar object instance to hold Cookiecookie = Cookielib. Cookiejar () # 2. Use Httpcookieprocessor () to create a cookie processor object with the parameter Cookiejar () object cookie_handler = Urllib2. Httpcookieprocessor (Cookies) # 3. Build Openeropener = Urllib2.build_opener (cookie_handler) # 4 through Build_opener (). Addheaders accepts a list in which each element is a ganso of headers information, opener will be accompanied by headers information opener.addheaders = [("User-agent", "mozilla/5.0" ( Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.99 safari/537.36 ")]# 5. Required login account and password data = {"Email": "[email protected]", "Password": "Alaxxxxxime"} # 6. by UrlEncode () transcoding postdata = Urllib.urlencode (data) # 7. Build the requests Request object, which contains the username and password you want to send, request = Urllib2. Request ("http://www.renren.com/PLogin.do", data = postdata) # 8. Send this request via opener and get the cookie value after login, Opener.open (Request) # 9. Opener contains the value of the cookie after the user has logged in, and can directly access pages response = Opener.open ("Http://www.renren.com/410043129/profile") # 10 that can be accessed after login. Print the response contents Print Response.reAD ()

There are several points to note when simulating logins:

Login usually has an HTTP GET, used to pull some information and obtain a cookie, and then HTTP POST login.

The link to the HTTP post login may be dynamic, obtained from the information returned by get.

Some of the password are sent in plaintext, and some are sent after encryption. Some websites even use dynamic encryption, including a lot of other data encryption information, only by viewing the JS source code to obtain encryption algorithm, and then to crack encryption, very difficult.

Most Web sites are similar to the overall process of sign-in, and there may be some different details, so there is no guarantee that other sites will succeed.

In this test case, in order to let everyone quickly understand the knowledge point, we use the Renren login interface is the front of the network of the Hidden Interface (Shh ...), login more convenient. Of course, we can also send the account password directly to the login interface to simulate login, but when the Web page using JavaScript dynamic technology, it is too easy to block the HttpClient-based analog login, and even according to the characteristics of your mouse activity to accurately determine whether a real person in action. So, to do a generic analog login also have to choose other technologies, such as with the built-in browser engine crawler (keywords: Selenium, phantomjs), this we will learn in the future.

Python Cookie crawler processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More