Python uses the mechanism module to compile the main points of crawler analysis, pythonmechanism

Source: Internet
Author: User

Python uses the mechanism module to compile the main points of crawler analysis, pythonmechanism

The Mechanism is a replacement of some functions of urllib2, which can better simulate browser behavior and comprehensively implement web access control. Combined with beautifulsoup and re modules, this method can effectively parse web pages. I prefer this method.
The following mainly summarizes the behavior and several examples of simulating a browser using the mechanism (Google search, Baidu search, and Renren logon)
1. initialize and create a browser object
If easy_install installation is not required for the mechanism, the following code creates a browser object and makes some initialization settings. You can switch the installation process as needed. In fact, you can complete basic tasks only with the default settings.

#!/usr/bin/env pythonimport sys,mechanize#Browserbr = mechanize.Browser()#optionsbr.set_handle_equiv(True)br.set_handle_gzip(True)br.set_handle_redirect(True)br.set_handle_referer(True)br.set_handle_robots(False)#Follows refresh 0 but not hangs on refresh > 0br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)#debugging?br.set_debug_http(True)br.set_debug_redirects(True)br.set_debug_responses(True)#User-Agent (this is cheating, ok?)br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]


2. Simulate browser behavior
The browser object can be used after it is created and initialized. The following example (the Code undertakes the above part)
Obtain web pages:
Branch printing allows you to view detailed information one by one.

r = br.open(sys.argv[1])html = r.read()print htmlprint br.response().read()print br.title()print r.info()


Simulate Google and Baidu queries
Print and select forms, enter the corresponding key value, and complete the operation through post submission

for f in br.forms(): print fbr.select_form(nr=0)

Google query football

br.form['q'] = 'football'br.submit()print br.response().read()


Baidu query football

br.form['wd'] = 'football'br.submit()print br.response().read()


Key Value Name, which can be printed

Back)
Simple operation: print the url to verify whether the rollback is performed.

# Backbr.back()print br.geturl()

3. Basic http Authentication

br.add_password('http://xxx.com', 'username', 'password')br.open('http://xxx.com')

4. form Authentication
Take logon to Renren as an example. Print forms to find out the username and password key information.

br.select_form(nr = 0)br['email'] = usernamebr['password'] = passwordresp = self.br.submit()

5. cookie support
By importing the cookielib module and setting browser cookies, you do not need to authenticate the network behavior to be authenticated again. Save the session cookie to re-access it. The Cookie Jar completes this function.

#!/usr/bin/env pythonimport mechanize, cookielibbr = mechanize.Browser()cj = cookielib.LWPCookieJar()br.set_cookiejar()

6. proxy Settings
Set http Proxy

#Proxybr.set_proxies({"http":"proxy.com:8888"})br.add_proxy_password("username", "password")#Proxy and usrer/passwordbr.set_proxies({"http":"username:password@proxy.com:8888"})

7. High memory usage

I wrote a crawler script using "mechanic ize" and wanted to crawl about 0.3 million images from a website.
 
The entire process is:
1. Get the target page address
2. Obtain the URLs of all images on the first few pages of the target address.
3. Download these URLs and save the index data to the mysql database.


This script downloads an image every second (the network is only about 200 kb/S, which is the bottleneck)
When we downloaded about 15000 images, we found that the image was getting slower and slowed down.
I used ps aux to check and found the process sleep. It was strange.
Free: Let's take a look. The memory is only MB (the total system memory is 4 GB)
After a blind stroll on the Internet, it is found that the original mechanism saves the simulated operation history by default, resulting in a larger memory usage:
Http://stackoverflow.com/questions/2393299/how-do-i-disable-history-in-python-mechanize-module
 
For convenience, the following is a translation:
During the initialization of Browser (), if you do not send a history object to it as a parameter, Browser () Will initialize it in the default method (allow saving operation history, you can just upload any history to it, such as customizing a NoHistory object:
 

class NoHistory(object):  def add(self, *a, **k): pass  def clear(self): pass  b = mechanize.Browser(history=NoHistory()) 
Articles you may be interested in:
  • Use the mechanism module in Python to simulate the browser function
  • Batch download of School Intranet album photos by using mechanize IN Ruby
  • Tutorial on using mechanics in Ruby
  • Simple example of using the mechanism library in python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.