I recently read about web crawlers and simulated login. I found such a package
Machize['Mek?.Na?Z]It is also known as the meaning of mechanization. It does mean automation.
mechanize.BrowserAndmechanize.UserAgentBaseImplement the interfaceurllib2.OpenerDirector, So:
Any URL can be opened, not justhttp:
mechanize.UserAgentBaseOffers Easy dynamic configuration of User-Agent features like protocol, Cookie, redirection androbots.txtHandling, without having to make a newOpenerDirectorEach time, e.g. By callingbuild_opener().
Easy HTML form filling.
Convenient link parsing and following.
Browser History (.back()And.reload()Methods ).
TheRefererHTTP header is added properly (optional ).
Automatic observancerobots.txt.
Automatic handling of HTTP-equiv and refresh.
That is to say, the values of "machize. Browser" and"Machize. useragentbase isurllib2.OpenerDirectorSo, including the HTTP protocol, all protocols can be opened.
In addition, it provides a simpler configuration method instead of creating a newOpenerDirector
Operations on table lists, browsing history and reload operations on chain operations, monitoring operations on robots.txt, etc.
Import reimport mechanic
(1) instantiate a browser object BR = mechanic. browser () (2) open a URL
BR. Open ("http://www.example.com/") (3) 2nd links under the page meeting text_regex
# Follow second link with element text matching regular expressionresponse1 = BR. follow_link (text_regex = r "cheese \ s * shop", Nr = 1) assert BR. viewing_html () (4) webpage name
Print Br. Title () (5) print the web site
Print response1.geturl () (6) webpage Header
Print response1.info () # headers (7) webpage body
Print response1.read () # Body
(8) Select formbr. select_form (name = "order") for name = "order" # browser passes through unknown attributes (including methods) # To the selected htmlform.
(9) Assign BR ["Cheeses"] = ["Mozzarella", "caerphilly"] # (the method here is _ setitem _) to the form with name = cheeses __) # submit current form. browser CILS. close () on the current response on # navigation, so this closes response1 (10) Submit
Response2 = BR. submit () # print currently selected form (don't call. submit () on this, use BR. submit () print BR. form (11) returns response3 = BR. back () # Back To cheese shop (same data as response1) # The History mechanic returns cached response objects # We can still use the response, even though it was. close () d
Response3.get _ data () # Like. Seek (0) followed by. Read () (12) refresh the webpage
Response4 = Br. Reload () # fetches from server (13) This can list all the forms on this page
For form in BR. forms (): print form #. links () optionally accepts the keyword ARGs. follow _/. find_link () for link in BR. links (url_regex = "python.org"): Print link BR. follow_link (Link) # Takes either link instance or keyword ARGs BR. back ()
This is an example provided in the document. The basic explanation has been provided in the Code.
You may controlBrowser's policyBy using the methodsmechanize.Browser'S base class,mechanize.UserAgent. For example:
Passmechanize.UserAgentIn this module, we can implementBrowser's policyThe Code is as follows. It is also an example from the document:
br = mechanize.Browser()# Explicitly configure proxies (Browser will attempt to set good defaults).# Note the userinfo ("joe:[email protected]") and port number (":3128") are optional.br.set_proxies({"http": "joe:[email protected]:3128","ftp": "proxy.example.com", })# Add HTTP Basic/Digest auth username and password for HTTP proxy access.# (equivalent to using "joe:[email protected]" form above)
br.add_proxy_password("joe", "password")
# Add HTTP Basic/Digest auth username and password for website access.br.add_password("http://example.com/protected/", "joe", "password")
# Don‘t handle HTTP-EQUIV headers (HTTP headers embedded in HTML).br.set_handle_equiv(False)
# Ignore robots.txt. Do not do this without thought and consideration.br.set_handle_robots(False)
# Don‘t add Referer (sic) headerbr.set_handle_referer(False)
# Don‘t handle Refresh redirectionsbr.set_handle_refresh(False)
# Don‘t handle cookiesbr.set_cookiejar()
# Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by# default: no need to do this unless you have some reason to use a# particular cookiejar)br.set_cookiejar(cj)
# Log information about HTTP redirects and Refreshes.br.set_debug_redirects(True)
# Log HTTP response bodies (ie. the HTML, most of the time).br.set_debug_responses(True)
# Print HTTP headers.br.set_debug_http(True)# To make sure you‘re seeing all debug output:logger = logging.getLogger("mechanize")logger.addHandler(logging.StreamHandler(sys.stdout))logger.setLevel(logging.INFO)# Sometimes it‘s useful to process bad headers or bad HTML:response = br.response() # this is a copy of responseheaders = response.info() # currently, this is a mimetools.Messageheaders["Content-type"] = "text/html; charset=utf-8"response.set_data(response.get_data().replace("<!---", "<!--"))br.set_response(response)
In addition, there are some webpage interaction modules similar to the Mechanism,
There are several wrappers around mechanic designed for functional testing of Web applications:
In the final analysis, they all encapsulate urllib2. Therefore, you can select a better module!