Crawlers start with "mechanic ".

Source: Internet
Author: User

Crawlers start with "mechanic ".

Machize

 

This article is only for study notes. You are welcome to discuss and make mistakes.

The following are basic operations:

 

1 import mechanic 2 # create a browser object 3 br = mechanic. browser () 4 # below are some basic settings 5 # Set whether to process the HTML html-equiv header. When the Browser and other devices receive files transmitted by the server, first, it receives the relevant name/value pairs of the file. 6 # For example, <meta http-equiv = "content-type" content = "text/html; charset = UTF-8"/> obtain the html encoding method 7 br. set_handle_equiv (True) 8 # whether to send a referer header to each request. Referer is part of the http request header. The value of the header field indicates which URL triggers access to the current page. 9 br. set_handle_referer (True) 10 # Set whether to comply with the robots Protocol 11 br. set_handle_robots (False) 12 # Set whether to handle redirection. Redirect is to Redirect various network requests to another location in various ways. 13 # redirection will occur when the website adjustment or webpage is moved to a new address or webpage extension changes. 14 br. set_handle_redirect (True) 15 # Set whether to process gzip Transfer Encoding. Gzip is a data compression format. 16 # The ratio of GZIP compression is usually 3 to 10 times, that is, the original 90 K size page. After compression, the actual size of the transmitted content is only 28 to 30 K, this greatly saves the server
Network bandwidth. If the application responds quickly enough, the website's speed bottleneck will be converted to the network transmission speed. Therefore, after the content is compressed, the page browsing speed will be greatly improved. 17 br. set_handle_gzip (True) 18 19 br. set_handle_refresh (mechanic. _ http. HTTPRefreshProcessor (), max_time = 1) 20 # configure debug-related items 21 br. set_debug_http (True) 22 br. set_debug_redirects (True) 23 br. set_debug_response (True) 24 # setting the http header tells the website that I am an access from the Mozilla browser, not a crawler 25 br. addheaders = [('user-agent', 'mozilla/5.0 (X11; U; Linux i686; en-US; rv: 1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1. fc9 Firefox/3.0.1 ')] 26 # Open URL. 27 br. open ("http://www.baidu.com") 28 # The opened web site will have a framework, you can view the framework, the framework operations, to complete the complex interaction 29 for form in br. forms (): 30 print form

 

 

We can see that there is only one framework named f. Sometimes the framework does not have a name, so it can only be sorted in order. The first is nr = 0, the second is nr = 1, and so on.

1 # select the framework f2 br. select_form (name = 'F') 3 # Write the content to be searched to wd in the framework 4 br. form ['wd '] = 'python' 5 # submit, which is equivalent to clicking the search button on the Baidu web page 6 br. submit () 7 # Check whether the opened webpage is expected to be 8 print (br. title ())

 

The running result is as follows:

 
1 # You can also output html text 2 print br. response (). read () 3 # because it is too long, here a little 4 # output all URLs in this webpage 5 for link in br. links (): 6 print ("text: % s, url: % s" % (link. text, link. url) 7 # That is, its url Information 8 # You can also select a link to open 9 newUrl = br again. click_link (text = 'What magical and interesting things can be done using the Python programming language? -Zhihu') 10 br. open (newUrl) 11 # Use the preceding operations to perform interactive operations on the webpage to obtain the desired 12 # You can also use this to return to the previous page 13 br. back () 14 # view the current url and check whether 15 print (br. geturl ())
 

 

1 # log on to the website using the user name and password 2 br. add_password ('HTTP: // The URL you want to log on. com ', 'username', 'Password') 3 # And then 4 br. open ('HTTP: // The URL you want to log on. com ') 5 # You can also open the website and then output the form. Use br. form ['username'] = ''method to log on to table 6 7 8 # obtain the cookie and use the cookie to log on to the website. You must first log on to the website to obtain the cookie of the website. 9 # The following example shows how to obtain the cookie of zhihu. cookie Example 10 import cookielab, mechanical ize11 br. machize. browser () 12 br. open ('https: // www.zhihu.com/question/41532365/answer/246810982') 13 c = cookie. LWPCookieJar () 14 br. back () 15 br. set_cookiejar () 16 br. open ('https: // www.zhihu.com/question/41532365/answer/246810982') 17 18 # Set proxy19 br. set_proxies ({'https', 'xxx. xxx. xxx. xx: xxx '}) 20 br. add_proxy_password ('username', 'Password') 21 # Or 22 br. set_proxies ('HTTP ', 'username: password@xxx.xxx.xxx.xx: XXX ')

 

During the initialization of Browser (), if you do not send a history object to it as a parameter, Browser () Will initialize it in the default method (allow saving operation history, in this way, the memory will be occupied every time, resulting in a slower and slower speed.

Solution: Customize a NoHistory object and pass it to it:

class NoHistory(object):  def add(self, *a, **k): pass  def clear(self): pass  b = mechanize.Browser(history=NoHistory())

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.