Using the Mechanize module in Python to simulate browser capabilities

Source: Internet
Author: User
Tags requires urlencode python script

This article mainly introduces the use of the Mechanize module in Python to simulate browser functions, including the use of cookies and set up a proxy functions such as the implementation of the needs of friends can refer to the

It is often useful to know how to quickly instantiate a browser in a command line or in a Python script.

Every time I need to do any automatic task about the web, I use this Python code to simulate a browser.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20-21 Import mechanize Import Cookielib # Browser BR = mechanize. Browser () # Cookie Jar CJ = Cookielib. Lwpcookiejar () Br.set_cookiejar (CJ) # Browser Options Br.set_handle_equiv (True) Br.set_handle_gzip (true) br.set_ Handle_redirect (True) Br.set_handle_referer (True) br.set_handle_robots (False) # follows refresh 0 but not hangs on refres H > 0 Br.set_handle_refresh (mechanize._http. Httprefreshprocessor (), max_time=1) # Want debugging messages? #br. Set_debug_http (True) #br. Set_debug_redirects (True) #br. Set_debug_responses (True) # user-agent (This is cheating, OK?) Br.addheaders = [(' User-agent ', ' mozilla/5.0 ') (X11; U Linux i686; En-us; rv: gecko/2008071615 fedora/3.0.1-1.fc9 firefox/3.0.1 ')]

Now you've got a browser example of a BR object. With this object, you can open a page with code similar to the following:


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24-25 # Open Some site, let ' s pick a random one, the ' I ' pops in mind:r = (' ') HTML = () # Show the source print HTML # or print br.response (). read () # show the HTML title Print Br.title () # Show the response hea DERs Print () # or Print br.response (). info () # Show the available forms to F in Br.forms (): Print F # Select the F Irst (index zero) Form br.select_form (nr=0) # let ' s search br.form[' Q ']= ' Weekend Codes ' () print Br.submit (). Re AD () # Looking at some results into link format for L in Br.links (url_regex= ' stockrt '): Print L

If you visit a website that requires authentication (HTTP basic auth), then:


1 2 3 4 # If the protected site didn ' t receive the authentication data you would # end up with a 410 error in your face Br.add_pas Sword (' Http://safe-site.domain ', ' username ', ' password ') (' Http://safe-site.domain ')

Since the cookie Jar was used before, you do not need to manage the login session for the site. That is, there is no need to manage the need to post a user name and password.

Usually in this case, the site will request your browser to store a session cookie unless you repeatedly log in,

Which causes your cookie to contain this field. All these things, storing and re-sending this session cookie has been done by Cookie jar, cool bar.

At the same time, you can manage your browser history:


1 2 3 4 5 6 7 8 9 10 11-12 # Testing presence of link (if the link is not found your would have to # handle a Linknotfounderror exception) Br.find_lin K (text= ' Weekend Codes ') # actually clicking the link req = Br.click_link (text= ' weekend Codes ') (req) Print Br.respo NSE (). Read () print br.geturl () # back Br.back () print br.response (). Read () print br.geturl ()

Download a file:


1 2 3 4 # Download F = br.retrieve (' logo.gif ') [0] print F fh = open (f)

Setting up proxies for HTTP


1 2 3 4 5 6 # Proxy and User/password br.set_proxies ({"http": ""}) # proxy br.set_proxies ({" HTTP ":" "}) # Proxy password Br.add_proxy_password (" Joe "," password ")

But if you just want to open the page without all the magical features that you have before, you can:


1 2 3 4 5 6 7 # Simple Open? Import urllib2 Print Urllib2.urlopen (' '). Read () # with password? Import Urllib opener = Urllib. Fancyurlopener () Print (' '). Read ()

You can learn more by mechanize The official website, mechanize documents and clientform documents.

Original from:



Finally, let's talk about a very important concept and technology when accessing pages through code: Cookies

We all know that HTTP is a connectionless state protocol, but the client and server side need to maintain some mutual information, such as cookies, with cookies, the server can know that the user has just logged on to the site, will give clients access to some pages of permissions.

For example, using a browser to login Sina Weibo, you must first login, after the successful landing, open other pages to be able to access. Use the program to log on Sina Weibo or other authentication sites, the key point is that you need to save cookies, and then come with cookies to visit the site, can achieve results.

This requires the help of Python's cookielib and Urllib2, which bind Cookielib to urllib2 together, and can be accompanied by cookies when the page is requested.

The first step, first, with Firefox Httpfox plug-in, in the browser to start browsing Sina Weibo home page, and then landing, from the Httpfox record, see each step sent those data request that URL; and then, in Python, simulate the process, Use Urllib2.urlopen to send the username password to the landing page, get the cookie after the landing, and then visit other pages to obtain micro-blog data.

The primary role of the Cookielib module is to provide objects that can store cookies to facilitate access to Internet resources in conjunction with the URLLIB2 module. For example, you can use objects from the Cookiejar class in this module to capture cookies and resend them on subsequent connection requests. The Coiokielib module uses the following main objects: Cookiejar, Filecookiejar, Mozillacookiejar, Lwpcookiejar.

The Urllib module is similar to the Urllib module, which opens the URL and gets the data from it. Unlike the Urllib module, urllib modules can use the Urlopen () function as well as customize opener to access Web pages. Also note that the Urlretrieve () function is in the Urllib module, and the function is not present in the URLLIB2 module. However, the use of URLLIB2 modules is generally inseparable from the Urllib module, because the post data need to use the Urllib.urlencode () function to encode.

Cookielib modules are generally used in conjunction with URLLIB2 modules, mainly used in Urllib2.build_oper () functions as URLLIB2. The Httpcookieprocessor () parameter. Use methods such as the following login code for Renren:


1 2 3 4 5 6 7 8 9 #! /usr/bin/env python #coding =utf-8 import urllib2 import urllib import cookielib data={"email": "username", "password": "Password"} # Login username and password Post_data=urllib.urlencode (data) cj=cookielib. Cookiejar () Opener=urllib2.build_opener (urllib2. Httpcookieprocessor (CJ)) headers ={"user-agent": mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1 "} req=urllib2. Request ("Http://", Post_data,headers) (req) print (). Decode ("Utf-8"). Encode ("GBK")
Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.