Use the mechanism module in Python to simulate the browser function
This article describes how to use the mechanism module in Python to simulate browser functions, including cookie and proxy settings. For more information, see
It is usually useful to know how to quickly instantiate a browser in a command line or python script.
Every time I need to do any Automatic web tasks, I use this python code to simulate a browser.
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Import mechanic Import cookielib # Browser Br = mechanic. Browser () # Cookie Jar Cj = cookielib. LWPCookieJar () Br. set_cookiejar (cj) # Browser options Br. set_handle_equiv (True) Br. set_handle_gzip (True) Br. set_handle_redirect (True) Br. set_handle_referer (True) Br. set_handle_robots (False) # Follows refresh 0 but not hangs on refresh> 0 Br. set_handle_refresh (mechanic. _ http. HTTPRefreshProcessor (), max_time = 1) # Want debugging messages? # Br. set_debug_http (True) # Br. set_debug_redirects (True) # Br. set_debug_responses (True) # User-Agent (this is cheating, OK ?) Br. addheaders = [('user-agent', 'mozilla/5.0 (X11; U; Linux i686; en-US; rv: 1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1. fc9 Firefox/3.0.1 ')] |
Now you get a browser example, the br object. With this object, you can open a page and use code similar to the following:
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# Open some site, let's pick a random one, the first that pops in mind: R = br. open ('HTTP: // google.com ') Html = r. read () # Show the source Print html # Or Print br. response (). read () # Show the html title Print br. title () # Show the response headers Print r.info () # Or Print br. response (). info () # Show the available forms For f in br. forms (): Print f # Select the first (index zero) form Br. select_form (nr = 0) # Let's search Br. form ['q'] = 'weekend Code' Br. submit () Print br. response (). read () # Looking at some results in link format For l in br. links (url_regex = 'stockrt '): Print l |
If the website you visit needs to be verified (http basic auth), then:
?
1 2 3 4 |
# If the protected site didn't receive the authentication data you wowould # End up with a 410 error in your face Br. add_password ('HTTP: // safe-site.domain ', 'username', 'Password ') Br. open ('HTTP: // safe-site.domain ') |
Because Cookie Jar is used before, you do not need to manage the logon session of the website. That is, you do not need to POST a user name and password.
In this case, the website will request your browser to store a session cookie unless you log on again,
As a result, your cookie contains this field. All these things have been done by cookie Jar to store and resend the session Cookie.
At the same time, you can manage your browser history:
?
1 2 3 4 5 6 7 8 9 10 11 12 |
# Testing presence of link (if the link is not found you wowould have # Handle a LinkNotFoundError exception) Br. find_link (text = 'weekend Code ') # Actually clicking the link Req = br. click_link (text = 'weekend Code ') Br. open (req) Print br. response (). read () Print br. geturl () # Back Br. back () Print br. response (). read () Print br. geturl () |
Download an object:
?
1 2 3 4 |
# Download F = br. retrieve ('HTTP: // www.google.com.br/intl/pt-BR_br/images/logo.gif') [0] Print f Fh = open (f) |
Set proxy for http
?
1 2 3 4 5 6 |
# Proxy and user/password Br. set_proxies ({"http": "joe: password@myproxy.example.com: 3128 "}) # Proxy Br. set_proxies ({& quot; http & quot;: & quot; myproxy.example.com: 3128 & quot "}) # Proxy password Br. add_proxy_password ("joe", "password ") |
However, if you only want to open the web page and do not need all the magical functions, you can:
?
1 2 3 4 5 6 7 |
# Simple open? Import urllib2 Print urllib2.urlopen ('HTTP: // stockrt.github.com '). read () # With password? Import urllib Opener = urllib. FancyURLopener () Print opener. open ('HTTP: // user: password@stockrt.github.com '). read () |
You can learn more from the official website of "machize", "machize", and "ClientForm.
From: http://reyoung.me/index.php/2012/08/08/%E7%BF%BB%E8%AF%91%E4%BD%BF%E7%94%A8python%E6%
A8 % A1 % E4 % BB % BF % E6 % B5 % 8F % E8 % A7 % 88% E5 % 99% A8 % E8 % A1 % 8C % E4 % B8 % BA/
------------------------------
Finally, let's talk about a very important concept and technology when accessing a page through code: cookie.
We all know that HTTP is a non-connection status protocol, but the client and server need to maintain some mutual information, such as cookies. With cookies, the server can know that the user just logged on to the website, to allow the client to access some pages.
For example, if you use a browser to log on to Sina Weibo, you must first log on. After successful login, you can access other web pages. When you use a program to log on to Sina Weibo or another verification website, the key point is that you need to save the cookie and then access the website with the cookie to achieve the effect.
Here, we need the cooperation of cookielib and urllib2 of Python, and bind cookielib to urllib2 to attach a cookie to the request webpage.
The first step is to use the httpfox plug-in of firefox to browse the Sina Weibo homepage in the browser, and then log on to it, view the URL of the data request sent in each step, and then simulate the process in python. Use urllib2.urlopen to send the user name and password to the login page and obtain the cookie after login, then visit other pages to get Weibo data.
The main function of the cookielib module is to provide objects that can store cookies for use with the urllib2 module to access Internet resources. For example, you can use the CookieJar class object of this module to capture the cookie and resend it in subsequent connection requests. The coiokielib module mainly uses the following objects: CookieJar, FileCookieJar, MozillaCookieJar, and LWPCookieJar.
The urllib module is similar to the urllib module. It is used to open a URL and obtain data from it. Unlike the urllib module, the urllib module can not only use the urlopen () function, but also customize Opener to access webpages. Note that the urlretrieve () function is in the urllib module and does not exist in the urllib2 module. However, when using the urllib2 module, the urllib module is generally inseparable, because the POST data must be encoded using the urllib. urlencode () function.
The cookielib module is generally used in combination with the urllib2 module. It is mainly used in the urllib2.build _ extract () function as a parameter of urllib2.HTTPCookieProcessor. Use the following code to log on to Renren:
?
1 2 3 4 5 6 7 8 9 10 11 12 13 |
#! /Usr/bin/env python # Coding = UTF-8 Import urllib2 Import urllib Import cookielib Data = {"email": "username", "password": "password"} # login username and password Post_data = urllib. urlencode (data) Cj = cookielib. CookieJar () Opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj )) Headers = {"User-agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1 "} Req = urllib2.Request ("http://www.renren.com/PLogin.do", post_data, headers) Content = opener. open (req) Print content. read (). decode ("UTF-8"). encode ("gbk ") |