Python write crawlers use the urllib2 method, pythonurllib2

Source: Internet
Author: User
Tags epoch time

Python write crawlers use the urllib2 method, pythonurllib2

Use urllib2 for python write Crawlers

The Usage Details of urllib2 are sorted out.

1. Proxy Settings


By default, urllib2 uses the environment variable http_proxy to set HTTP Proxy.

If you want to explicitly control the Proxy in the program without being affected by environment variables, you can use the Proxy.

Create test14 to implement a simple proxy Demo:
import urllib2  enable_proxy = True  proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})  null_proxy_handler = urllib2.ProxyHandler({})  if enable_proxy:      opener = urllib2.build_opener(proxy_handler)  else:      opener = urllib2.build_opener(null_proxy_handler)  urllib2.install_opener(opener)

Note that using urllib2.install _ opener () sets the global opener of urllib2.

In this way, the subsequent use will be very convenient, but it cannot be more detailed, for example, you want to use two different Proxy settings in the program.

A better way is to directly call opener's open method instead of the global urlopen method.
2. Timeout settings


In the old version of Python (before Python2.6), The urllib2 API does not expose the Timeout settings. To set the Timeout value, you can only change the global Timeout value of the Socket.

Import urllib2 import socket. setdefatimetimeout (10) # timeout urllib2.socket. setdefatimetimeout (10) # Another Method

After Python 2.6, timeout can be directly set through the timeout parameter of urllib2.urlopen.

import urllib2  response = urllib2.urlopen('http://www.google.com', timeout=10)

3. Add a specific Header to the HTTP Request
To add a header, you must use the Request object:

import urllib2  request = urllib2.Request('http://www.baidu.com/')  request.add_header('User-Agent', 'fake-client')  response = urllib2.urlopen(request)  print response.read()

Pay special attention to some headers. The server will check these headers.

User-Agent: Some servers or proxies use this value to determine whether the request is sent by the browser.

Content-Type: When the REST interface is used, the server checks the value to determine how to parse the Content in the HTTP Body. Common values include:

Application/xml: Used in xml rpc, such as RESTful/SOAP calls

Application/json: used for json rpc calls

Application/x-www-form-urlencoded: used when the browser submits a Web form

When using the RESTful or SOAP service provided by the server, the Content-Type setting error may cause the server to reject the service.
4. Redirect


By default, urllib2 automatically performs a redirect action on the HTTP 3XX return code, without manual configuration. To check whether a redirect action has occurred, you only need to check whether the Response URL and Request URL are consistent.

import urllib2  my_url = 'http://www.google.cn'  response = urllib2.urlopen(my_url)  redirected = response.geturl() == my_url  print redirected  my_url = 'http://rrurl.cn/b1UZuP'  response = urllib2.urlopen(my_url)  redirected = response.geturl() == my_url  print redirected

If you do not want automatic redirect, you can customize the HTTPRedirectHandler class in addition to using the lower-level httplib library.

import urllib2  class RedirectHandler(urllib2.HTTPRedirectHandler):      def http_error_301(self, req, fp, code, msg, headers):          print "301"          pass      def http_error_302(self, req, fp, code, msg, headers):          print "303"          pass   opener = urllib2.build_opener(RedirectHandler)  opener.open('http://rrurl.cn/b1UZuP')

5. Cookie


Urllib2 automatically processes cookies. To obtain the value of a Cookie, you can do this:

import urllib2  import cookielib  cookie = cookielib.CookieJar()  opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))  response = opener.open('http://www.baidu.com')  for item in cookie:      print 'Name = '+item.name      print 'Value = '+item.value

After running the command, the Cookie value for accessing Baidu is output:


6. Use the PUT and DELETE methods of HTTP


Urllib2 only supports http get and POST methods. To use http put and DELETE methods, you can only use httplib libraries of lower layers. Even so, we can make urllib2 send a PUT or DELETE request in the following way:

import urllib2  request = urllib2.Request(uri, data=data)  request.get_method = lambda: 'PUT' # or 'DELETE'  response = urllib2.urlopen(request)

7. Get the HTTP return code


For 200 OK, you only need to use the getcode () method of the response object returned by urlopen to obtain the HTTP return code. However, for other return codes, urlopen throws an exception. At this time, we need to check the code attribute of the exception object:

import urllib2  try:      response = urllib2.urlopen('http://bbs.csdn.net/why')  except urllib2.HTTPError, e:      print e.code

8. Debug Log


When using urllib2, you can use the following method to open the debug Log, so that the content of the packet sending and receiving will be printed on the screen for debugging convenience. Sometimes you can save the packet capture work.

import urllib2  httpHandler = urllib2.HTTPHandler(debuglevel=1)  httpsHandler = urllib2.HTTPSHandler(debuglevel=1)  opener = urllib2.build_opener(httpHandler, httpsHandler)  urllib2.install_opener(opener)  response = urllib2.urlopen('http://www.google.com')

In this way, we can see the transmitted data packet content:


9. Form Processing


Do I need to fill in the form for Logon?

First, use a tool to intercept the content of the table to be filled in.

For example, I usually use the firefox + httpfox plug-in to see what packages I have actually sent.

Take verycd as an example. First, find your POST request and POST form items.

You can see that if verycd is used, you need to enter the username, password, continueURI, fk, and login_submit items, where fk is randomly generated (in fact, it is not random, it looks like the epoch time is generated by a simple code. You need to obtain the epoch time from the webpage. That is to say, you must first access the webpage and use regular expressions and other tools to intercept the fk items in the returned data. As the name suggests, continueURI can be written at will. login_submit is fixed, which can be seen from the source code. There is also username and password, which is obvious:
#-*-Coding: UTF-8-*-import urllib import urllib2 postdata = urllib. urlencode ({'username': 'wangxiaoguang ', 'Password': 'why888', 'continuuri': 'http: // www.verycd.com/', 'fk ':'', 'login _ submit ': 'login'}) req = urllib2.Request (url = 'HTTP: // secure.verycd.com/signin', data = postdata) result = urllib2.urlopen (req) print result. read ()

10. Disguised as browser access


Some websites dislike crawler visits, so they reject requests from crawlers.

In this case, we need to pretend to be a browser, which can be achieved by modifying the header in the http packet.
#…    headers = {      'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'  }  req = urllib2.Request(      url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/',      data = postdata,      headers = headers  )  #...

11. Deal with "anti-leeching"


Some sites have so-called anti-leeching settings, which are simple to describe,

It is to check whether the referer site itself is in the header of the request you sent,

So we only need to change the referer of headers to this website. The cnbeta is used as an example:

#...

Headers = {

'Referer': 'http: // www.cnbeta.com/articles'

}

#...

Headers is a dict data structure. You can put any desired header for some disguise.

For example, some websites like to read X-Forwarded-For from the header to see the real IP address of others. You can directly change X-Forwarde-.

Reference Source:
Guide to Using urllib2 to write python crawlers without basic knowledge
Http://www.lai18.com/content/384669.html

Additional reading

Collect and collect technical articles from the "no basic writing Python crawler" Series 

1. Definition and URL composition of a python Crawler

2. Use urllib2 to capture webpage content

3 Guide to Using urllib2 to write python Crawlers

4. Two important concepts in urllib2: Openers and Handlers

5 zero-Basic python crawler-based HTTP Exception Handling

6. Zero-Basic write python crawler crawling Baidu post bar code sharing

7 basic python crawler-based Regular Expression

8 basic write python crawler full record

9 install and configure Scrapy, a crawler framework of zero-basic writing python Crawlers

10 basic write python crawler package to generate exe files

11 basic writing python crawler crawlers crawl Baidu Post bars and store them to local txt file Ultimate Edition

12. Zero-basic writing python crawlers: Capturing gossip encyclopedia code sharing

13 zero-Basic write python crawler using Scrapy framework to write Crawlers

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.