"Reprint" Python Crawler Practice Simulation Login

Source: Internet
Author: User

Some sites set permissions, only after logging in to crawl the content of the site, how to simulate login, the current method is mainly to use browser cookies to impersonate the login. When a browser accesses a Web page, either by entering a domain name or IP through a URL or by clicking on a link, the browser makes an HTTP request to the Web server (HTTPRequest ), after the Web server receives the client browser requests, responds to the client's request, sends back the corresponding response information (Http Response), the browser parsing engine, the layout engine parses the returned content, and renders it to the user. The Web application is in the process of server interaction, where HTTP requests and responses are sent in a message structure.

HTTP message   When the browser sends a request to the server, the HTTP request message packet is issued, the server returns the data, the HTTP response message messages are issued, both types of messages are composed of a starting line, a message header, a blank line indicating the end of the message header and an optional message body. In an HTTP request message, the starting line includes the request method, the requested resource, the version number of the HTTP protocol, the message header contains various properties, the message body contains the data, and the GET request does not have a message body, so there is no additional data in the blank line after the message header. In the HTTP response message, the starting line includes the HTTP protocol version, the HTTP status code and the status, the message header contains various attributes, and the message body contains the data content returned by the server.    such as HTTP requests and HTTP responses fetched from fiddler, the GET request content is empty, so empty lines and message bodies after the message header are empty.    the server sends the response message as follows, the browser normally receives the HTTP message sent back by the server   from the above can be seen, the cookie in the HTTP request and HTTP response header information, the cookie is a very important property of the message header.    What is a cookie?         When a user first accesses a domain name through a browser, the Web server that is accessed sends data to the client to keep the state between the Web server and the client, which is a cookie that is created by an Internet site &nbsp., in order to identify the user identity and stored in the user's local terminal data, the information in the cookie is generally encrypted, the cookie exists in the cache or hard disk, on the hard disk is a small text file, when you visit the site, will read the corresponding website cookie information, Cookies are an effective way to improve our online experience. In general, once a cookie is saved on a computer, only the site that created the cookie can read it.    why the need for cookie        HTTP protocol is a stateless connection-oriented protocol that is based on protocols on the TCP/IP protocol layer, After the client has established a connection with the server, the TCP connection between them is always maintained, as to how long it will take for the server side to set up, and when the client accesses the server again, it continues to use the last established connection, but because the HTTP protocol is stateless, The Web server does not know whether the two requests are the same client, and the two requests are independent. To solve this problem, the Web program introduces a cookie mechanism toProtection status. A cookie can record the user's login status, and the Web server usually sends a signature after the user has successfully logged in to mark the validity of the session, thus eliminating the user's multiple authentication and login site. Record the user's access status. Types of  cookie          Session cookies: This type of cookie is only valid for the duration of the session, stored in the browser's cache, and when the user visits the site, The session cookie is created and is deleted by the browser when the browser is closed.   Persistent Cookie (persistent cookie): This type of cookie takes effect for a long time in a user session. When you set the property of the cookie to max-age for 1 months, the cookie will be in the HTTP request for each relevant URL in this month. So it can record a lot of user initialization or custom information, such as when the first login and weak login state and so on.  secure Cookies: Security cookies are cookie patterns under HTTPS access to ensure that cookies are always encrypted as they are passed from the client to the server. HttpOnly cookies: This type of cookie can only be passed on HTTP (HTTPS) requests and is not valid for client scripting languages, thus effectively avoiding cross-site attacks.   Third-party cookies: The first-party cookie is the generated cookie under the domain name or subdomain that is currently being accessed. Third-party cookies: third-party cookies are cookies created by third-party domain names.   cookie composition        Cookies are an attribute in the HTTP message header, including: The value of the cookie name cookie, The expiration time of the cookie (expires/max-age), the cookie Action Path (path), the domain name where the cookie resides, and the use of a cookie for secure connection. The first two parameters are required for the cookie application, and also include the cookie size (size, the number of cookies and the size limit of different browsers vary).  python Analog Login   Set up a cookie processing object that is responsible for adding the cookie to the HTTP request and being able to get cookie ,  from the HTTP response to send a request to the website login page, PackageThe Http header uses Urllib2.urlopen to send the request, receiving the response of the Web server, including the data that the login Url,post requested. First we check the landing page source code    when we use urllib processing URL, is actually through URLLIB2. Openerdirector the instance, he invokes the resource for various operations such as protocol, open URL, cookie processing, and so on. The Urlopen method uses the default opener to handle the problem, and the basic Urlopen () function does not support authentication, cookies, or other advanced HTTP features. To support these features, you must use the Build_opener () function to create your own custom opener object. The  cookielib module defines the classes that automatically handle HTTP cookies, which are used to access Web sites that require cookie data, and the Cookielib module includes Cookiejar,filecookiejar,cookiepolicy, Subclass Mozillacookiejar and Lwpcookiejar,cookiejar objects of Defaultcookiepolicy,cookie and Filecookiejar can manage HTTP cookies, Adding a cookie to an HTTP request and being able to get the Cookie,filecookiejar object from the HTTP response is primarily to read a cookie from a file or create a cookie, where Mozillacookiejar is designed to create a Filecookiejar instance that is compatible with Mozilla browser Cookies.txt, Lwpcookiejar is designed to create a Libwww-perl file format that is compatible with SET-COOKIE3 Ar instance, the cookie file saved with Lwpcookiejar is easy for human to read. The default is that Filecookiejar does not have a save function, and Mozillacookiejar or Lwpcookiejar are already implemented. So you can use Mozillacookiejar or Lwpcookiejar, to automatically implement the cookie save.    [Python]View PlainCopy
  1. #! /usr/bin/env python
  2. #coding: Utf-8
  3. Import Sys
  4. Import re
  5. Import Urllib2
  6. Import Urllib
  7. Import requests
  8. Import Cookielib
  9. # # This piece of code is used to solve the problem of Chinese error
  10. Reload (SYS)
  11. Sys.setdefaultencoding ("UTF8")
  12. #####################################################
  13. #登录人人
  14. loginurl = ' http://www.renren.com/PLogin.do '
  15. Logindomain = ' renren.com '
  16. Class Login (object):
  17. def __init__ (self):
  18. self.name = ' '
  19. SELF.PASSWPRD = ' '
  20. self.domain = ' '
  21. SELF.CJ = Cookielib. Lwpcookiejar ()
  22. Self.opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (self.cj))
  23. Urllib2.install_opener (self.opener)
  24. def setlogininfo (self,username,password,domain):
  25. "'set User login information '
  26. Self.name = Username
  27. self.pwd = password
  28. self.domain = Domain
  29. def login (self):
  30. "'login website '
  31. Loginparams = {' domain ':self.domain,' email ':self.name, ' password ':self.pwd}
  32. headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/31.0.1650.57 safari/537.36 '}
  33. req = Urllib2. Request (loginurl, Urllib.urlencode (loginparams), headers=headers)
  34. Response = Urllib2.urlopen (req)
  35. self.operate = self.opener.open (req)
  36. thepage = Response.read ()
  37. if __name__ = = ' __main__ ':
  38. Userlogin = Login ()
  39. Username = ' username '
  40. Password = ' password '
  41. Domain = Logindomain
  42. Userlogin.setlogininfo (Username,password,domain)
  43. Userlogin.login ()

"Reprint" Python Crawler Practice Simulation Login

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.