It is often useful to know how to quickly instantiate a browser on the command line or in a Python script.
Every time I need to do any automated task on the web, I use this Python code to emulate a browser.
Import mechanizeimport cookielib# browserbr = mechanize. Browser () # Cookie JARCJ = cookielib. Lwpcookiejar () Br.set_cookiejar (CJ) # Browser Optionsbr.set_handle_equiv (True) Br.set_handle_gzip (true) br.set_ Handle_redirect (True) Br.set_handle_referer (True) br.set_handle_robots (False) # follows refresh 0 but not hangs on Refresh > 0br.set_handle_refresh (mechanize._http. Httprefreshprocessor (), max_time=1) # Want debugging messages? #br. Set_debug_http (True) #br. Set_debug_redirects (True) #br. Set_debug_responses (True) # user-agent (This is cheating, OK?) Br.addheaders = [(' User-agent ', ' mozilla/5.0 (X11; U Linux i686; En-us; rv:1.9.0.1) gecko/2008071615 fedora/3.0.1-1.fc9 firefox/3.0.1 ')]
Now you get an example of a browser, a BR object. Using this object, you can open a page with code similar to the following:
# Open Some site, let's pick a random one, the first pops in mind:r = Br.open (' http://google.com ') HTML = R.read () # Sh ow the Sourceprint html# orprint br.response (). read () # show the HTML Titleprint br.title () # Show the response headersprint R.info () # Orprint Br.response (). info () # Show the available formsfor F in br.forms (): print F # Select the first (index Zero) Formbr.select_form (nr=0) # Let's searchbr.form[' Q ']= ' Weekend Codes ' Br.submit () print br.response (). Read () # Looking at some results on link formatfor l in br.links (url_regex= ' Stockrt '): print L
If you visit a website that requires authentication (HTTP basic auth), then:
# If the protected site didn ' t receive the authentication data would# end up with a 410 error in your FACEBR.ADD_PASSW Ord (' Http://safe-site.domain ', ' username ', ' password ') br.open (' Http://safe-site.domain ')
Since the cookie Jar was used before, you do not need to manage the login session of the website. That is, you do not need to manage a user name and password to post a situation.
In this case, the website will ask your browser to store a session cookie unless you repeatedly log in,
Which causes your cookie to contain this field. All of these things, store and re-send this session cookie has been taken care of by Cookie jar, cool bar.
At the same time, you can manage your browser history:
# Testing presence of link (if the link isn't found you would has to# handle a Linknotfounderror exception) Br.find_link ( text= ' Weekend codes ') # actually clicking the Linkreq = Br.click_link (text= ' Weekend codes ') Br.open (req) Print Br.response (). Read () print br.geturl () # backbr.back () print br.response (). Read () print br.geturl ()
To download a file:
# DOWNLOADF = Br.retrieve (' http://www.google.com.br/intl/pt-BR_br/images/logo.gif ') [0]print FFH = open (f)
Set proxy for HTTP
# Proxy and User/passwordbr.set_proxies ({"http": "joe:password@myproxy.example.com:3128"}) # Proxybr.set_proxies ({" HTTP ":" myproxy.example.com:3128 "}) # Proxy Passwordbr.add_proxy_password (" Joe "," password ")
But if you just want to open a webpage without all the magical features you have before, you can:
# simple Open?import urllib2print urllib2.urlopen (' http://stockrt.github.com '). Read () # with Password?import Urllibopener = Urllib. Fancyurlopener () Print opener.open (' http://user:password@stockrt.github.com '). Read ()
You can learn more through the Mechanize official website, mechanize documentation and Clientform documentation.
Originally from: Http://reyoung.me/index.php/2012/08/08/%E7%BF%BB%E8%AF%91%E4%BD%BF%E7%94%A8python%E6%A8%A1%E4%BB%BF%E6%B5 %8f%e8%a7%88%e5%99%a8%e8%a1%8c%e4%b8%ba/
——————————————————————————————
Finally, let's talk about a very important concept and technique for accessing pages through code: Cookies
We all know that HTTP is a non-connected state protocol, but the client and server side need to maintain some mutual information, such as cookies, cookies, the server can know just now that the user is logged on to the site, will give the client access to some of the page permissions.
For example, to log on to Sina Weibo with a browser, you must first log in, after the successful landing, open other pages to be able to access. Using a program to log on to Sina Weibo or other verification sites, the key point is that you need to save the cookie, followed by a cookie to visit the site, to achieve the effect.
This is where Python's cookielib and urllib2 are required to bind Cookielib to urllib2 together, allowing cookies to be included when requesting a Web page.
Concrete practice, first step, with the Firefox Httpfox plug-in, in the browser to start browsing Sina Weibo home, and then log in, from the Httpfox record, see each step sent those data request that URL; then python, simulate the process, Use Urllib2.urlopen to send the username and password to the landing page, get the cookie after landing, then visit other pages to get the Weibo data.
The primary role of the Cookielib module is to provide objects that store cookies to facilitate access to Internet resources in conjunction with the URLLIB2 module. For example, you can use the object of the Cookiejar class of this module to capture a cookie and resend it on subsequent connection requests. The main objects used in the Coiokielib module are the following: Cookiejar, Filecookiejar, Mozillacookiejar, Lwpcookiejar.
The Urllib module is similar to the Urllib module, which opens and obtains data from the URL. Unlike the Urllib module, the Urllib module can not only use the Urlopen () function but also customize opener to access Web pages. Also note: the Urlretrieve () function is in the Urllib module, and the function does not exist in the URLLIB2 module. However, using the URLLIB2 module is generally inseparable from the Urllib module, because the post data needs to be encoded using the Urllib.urlencode () function.
The Cookielib module is generally used in conjunction with the URLLIB2 module and is used primarily as a urllib2 in the Urllib2.build_oper () function. The parameters of the Httpcookieprocessor (). Use the code like the following login Renren:
#! /usr/bin/env python#coding=utf-8import urllib2import urllibimport cookielibdata={"email": "username", "password": "Password"} # Login username and password Post_data=urllib.urlencode (data) cj=cookielib. Cookiejar () Opener=urllib2.build_opener (urllib2. Httpcookieprocessor (CJ)) headers ={"user-agent": mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1 "}req=urllib2. Request ("Http://www.renren.com/PLogin.do", Post_data,headers) Content=opener.open (req) print content.read (). Decode ("Utf-8"). Encode ("GBK")
For details, please refer to:
Http://www.crazyant.net/796.html Python uses cookielib and urllib2 to simulate landing on Sina Weibo and crawl data
http://my.oschina.net/duhaizhang/blog/69342 URLLIB2 Module
https://docs.python.org/2/library/cookielib.html Cookielib-cookie handling for HTTP clients