Python crawler practice-simulated Login

Source: Internet
Author: User
Tags subdomain name

Some websites have configured permissions to crawl the website content only after logon. The current method is to simulate logon using browser cookies.
Web browser access to the server
When a user accesses a webpage, the browser sends an HTTP Request to the WEB server, whether it is entering a domain name or IP address through a URL or clicking a link ), after the WEB server receives a request from the client browser, it responds to the request from the client and sends back the corresponding Response information (Http Response). The browser parsing engine analyzes the returned content and presents it to the user. WEB applications are in the process of server interaction. HTTP requests and responses are sent in a message structure.



Http message
When a browser sends a request to the server, it sends an http request message. When the server returns data, it sends an http Response Message. Both types of messages start from one line, A message header, an empty line indicating the end of the message header, and an optional message body. In an http request message, the start line includes the request method, requested resource, HTTP version number, and message header. The message body contains data, and the GET request does not have a message body, therefore, there is no other data in the blank line after the message header. In an Http response message, the start line includes the HTTP protocol version, http status code, and status. The message header contains various attributes and the message body contains the data returned by the server.


For example, for http requests and http responses captured from fiddler, the GET request content is empty, so the empty lines and message bodies after the message header are empty.


The server sends the following response message. The browser normally receives the http message sent from the server.


As shown in the preceding figure, cookie is an important attribute of the message header in the http request and http response headers.
What is Cookie? When a user accesses a domain name through a browser for the first time, the accessed WEB server sends data to the client to maintain the status between the WEB server and the client. The data is a Cookie, it is created by an Internet site. The information stored on the user's local terminal is encrypted to identify the user's identity. The Cookie is stored in the cache or on the hard disk, there are some small text files on the hard disk. When you visit this website, you will read the Cookie information of the corresponding website, which effectively improves our Internet experience. Generally, once a Cookie is saved on a computer, only the website that creates the Cookie can read it.


Why Cookie?
Http is a stateless connection-oriented protocol. Http is a protocol based on the TCP/IP protocol layer. After the client establishes a connection with the server, the TCP connections between them are always maintained. As to how long the retention time is, it is set through the server end. When the client accesses the server again, the last established connection will continue to be used. However, because the Http protocol is stateless, the WEB server does not know whether the two requests are the same client. The two requests are independent. To solve this problem, the Web Program introduces the Cookie mechanism to maintain the status. cookies can record the user's logon status. Generally, the web server sends a signature after the user successfully logs on to mark the validity of the session, which eliminates the need for multiple user authentication and website logon. Record the user's access status.
Cookie type
Session Cookie: This type of Cookie is valid only during the Session period and is stored in the browser cache. When a user accesses a website, the Session cookie is created, when the browser is closed, it will be deleted by the browser. Persistent Cookie (Persistent Cookie): This type of cookie takes effect in user sessions for a long time. If you set the Max-Age attribute of the cookie to one month, the cookie will be contained in each http request of the relevant URL in this month. Therefore, it can record a lot of user initialization or custom information, such as when to log on for the first time and weak logon status. Secure cookie: Secure cookie is a form of cookie accessed over https to ensure that the cookie is always encrypted when it is transmitted from the client to the Server. HttpOnly Cookie: This type of cookie can only be transmitted on http (https) requests. It is invalid for the client script language and effectively avoids cross-site attacks. Third-party cookie: the first-party cookie is the Cookie generated under the domain name or subdomain name currently accessed. Third-party cookie: a third-party cookie is a Cookie created by a third-party domain name.
Cookie Composition
Cookie is an attribute in the http message header, including: Cookie Name (Name) Cookie Value (Value), Cookie expiration time (Expires/Max-Age ), the Path of the Cookie, the Domain of the Cookie, and the Secure connection (Secure) using the Cookie ). The first two parameters are necessary for Cookie applications. In addition, they also include the Cookie Size (which varies with the number and Size of cookies in different browsers ).
Python simulated Logon
Set a cookie processing object. It is responsible for adding the cookie to the http Request and obtaining the cookie from the http response. It sends a Request to the website login page, including the login url, POST request data. The Http header uses urllib2.urlopen to send the request and receive the Response of the WEB server. First, check the source code of the login page.


When urllib is used to process a url, it actually works through the urllib2.OpenerDirector instance. It calls resources for various operations, such as using protocols, opening URLs, and processing cookies. The urlopen method uses the default opener to handle the problem. The basic urlopen () function does not support authentication, cookies, or other advanced HTTP functions. To support these functions, you must use the build_opener () function to create your own custom Opener object.
The cookielib module defines classes for automatically processing HTTP cookies to access websites that require cookie data. The cookielib module includes CookieJar, FileCookieJar, CookiePolicy, DefaultCookiePolicy, Cookie, and FileCookieJar subclasses MozillaCookieJar and LWPCookieJar, cookieJar objects can be used to manage FileCookieJar instances compatible with HTTP flood, and LWPCookieJar is used to create FileCoo that is compatible with libwww-perl Set-Cookie3 file formats KieJar instance. The cookie files stored in LWPCookieJar are easy to read. By default, FileCookieJar does not have the save function, and MozillaCookieJar or LWPCookieJar has been implemented. Therefore, you can use javasillacookiejar or LWPCookieJar to automatically save cookies.
#! /Usr/bin/env python # coding: utf-8import sysimport reimport urllib2import urllibimport requestsimport cookielib # This code is used to solve the problem of Chinese error reload (sys) sys. setdefaultencoding ("utf8 ") ######################################## ############# log on to Renren loginurl = 'HTTP: // www.renren.com/PLogin.do'logindomain = 'renren. com 'class Login (object): def _ init _ (self): self. name = ''self. passwprd = ''self. domain = ''self. cj = cookielib. LWPCookieJar () self. opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (self. cj) urllib2.install _ opener (self. opener) def setLoginInfo (self, username, password, domain): ''' sets the user logon information ''' self. name = username self. pwd = password self. domain = domain def login (self): ''' log on to the website ''' loginparams = {'domain ': self. domain, 'email ': self. name, 'Password': self. pwd} headers = {'user-agent': 'mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) chrome/31.0.1650.57 Safari/537.36 '} req = urllib2.Request (loginurl, urllib. urlencode (loginparams), headers = headers) response = urllib2.urlopen (req) self. operate = self. opener. open (req) thePage = response. read () if _ name _ = '_ main _': userlogin = Login () username = 'username' password = 'Password' domain = logindomainuserlogin. setLoginInfo (username, password, domain) userlogin. login ()



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.