Python uses the mechanism module to compile the main points of crawler analysis, pythonmechanism
The Mechanism is a replacement of some functions of urllib2, which can better simulate browser behavior and comprehensively implement web access control. Combined with beautifulsoup and re modules, this method can effectively parse web pages. I prefer this method.
The following mainly summarizes the behavior and several examples of simulating a browser using the mechanism (Google search, Baidu search, and Renren logon)
1. initialize and create a browser object
If easy_install installation is not required for the mechanism, the following code creates a browser object and makes some initialization settings. You can switch the installation process as needed. In fact, you can complete basic tasks only with the default settings.
#!/usr/bin/env pythonimport sys,mechanize#Browserbr = mechanize.Browser()#optionsbr.set_handle_equiv(True)br.set_handle_gzip(True)br.set_handle_redirect(True)br.set_handle_referer(True)br.set_handle_robots(False)#Follows refresh 0 but not hangs on refresh > 0br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)#debugging?br.set_debug_http(True)br.set_debug_redirects(True)br.set_debug_responses(True)#User-Agent (this is cheating, ok?)br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
2. Simulate browser behavior
The browser object can be used after it is created and initialized. The following example (the Code undertakes the above part)
Obtain web pages:
Branch printing allows you to view detailed information one by one.
r = br.open(sys.argv[1])html = r.read()print htmlprint br.response().read()print br.title()print r.info()
Simulate Google and Baidu queries
Print and select forms, enter the corresponding key value, and complete the operation through post submission
for f in br.forms(): print fbr.select_form(nr=0)
Google query football
br.form['q'] = 'football'br.submit()print br.response().read()
Baidu query football
br.form['wd'] = 'football'br.submit()print br.response().read()
Key Value Name, which can be printed
Back)
Simple operation: print the url to verify whether the rollback is performed.
# Backbr.back()print br.geturl()
3. Basic http Authentication
br.add_password('http://xxx.com', 'username', 'password')br.open('http://xxx.com')
4. form Authentication
Take logon to Renren as an example. Print forms to find out the username and password key information.
br.select_form(nr = 0)br['email'] = usernamebr['password'] = passwordresp = self.br.submit()
5. cookie support
By importing the cookielib module and setting browser cookies, you do not need to authenticate the network behavior to be authenticated again. Save the session cookie to re-access it. The Cookie Jar completes this function.
#!/usr/bin/env pythonimport mechanize, cookielibbr = mechanize.Browser()cj = cookielib.LWPCookieJar()br.set_cookiejar()
6. proxy Settings
Set http Proxy
#Proxybr.set_proxies({"http":"proxy.com:8888"})br.add_proxy_password("username", "password")#Proxy and usrer/passwordbr.set_proxies({"http":"username:password@proxy.com:8888"})
7. High memory usage
I wrote a crawler script using "mechanic ize" and wanted to crawl about 0.3 million images from a website.
The entire process is:
1. Get the target page address
2. Obtain the URLs of all images on the first few pages of the target address.
3. Download these URLs and save the index data to the mysql database.
This script downloads an image every second (the network is only about 200 kb/S, which is the bottleneck)
When we downloaded about 15000 images, we found that the image was getting slower and slowed down.
I used ps aux to check and found the process sleep. It was strange.
Free: Let's take a look. The memory is only MB (the total system memory is 4 GB)
After a blind stroll on the Internet, it is found that the original mechanism saves the simulated operation history by default, resulting in a larger memory usage:
Http://stackoverflow.com/questions/2393299/how-do-i-disable-history-in-python-mechanize-module
For convenience, the following is a translation:
During the initialization of Browser (), if you do not send a history object to it as a parameter, Browser () Will initialize it in the default method (allow saving operation history, you can just upload any history to it, such as customizing a NoHistory object:
class NoHistory(object): def add(self, *a, **k): pass def clear(self): pass b = mechanize.Browser(history=NoHistory())
Articles you may be interested in:
- Use the mechanism module in Python to simulate the browser function
- Batch download of School Intranet album photos by using mechanize IN Ruby
- Tutorial on using mechanics in Ruby
- Simple example of using the mechanism library in python