Python write crawlers use the urllib2 method, pythonurllib2

Last Update:2015-08-07 Source: Internet

Author: User

Tags epoch time

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python write crawlers use the urllib2 method, pythonurllib2

Use urllib2 for python write Crawlers

The Usage Details of urllib2 are sorted out.

1. Proxy Settings

By default, urllib2 uses the environment variable http_proxy to set HTTP Proxy.

If you want to explicitly control the Proxy in the program without being affected by environment variables, you can use the Proxy.

Create test14 to implement a simple proxy Demo:

import urllib2  enable_proxy = True  proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})  null_proxy_handler = urllib2.ProxyHandler({})  if enable_proxy:      opener = urllib2.build_opener(proxy_handler)  else:      opener = urllib2.build_opener(null_proxy_handler)  urllib2.install_opener(opener)

Note that using urllib2.install _ opener () sets the global opener of urllib2.

In this way, the subsequent use will be very convenient, but it cannot be more detailed, for example, you want to use two different Proxy settings in the program.

A better way is to directly call opener's open method instead of the global urlopen method.
2. Timeout settings

In the old version of Python (before Python2.6), The urllib2 API does not expose the Timeout settings. To set the Timeout value, you can only change the global Timeout value of the Socket.

Import urllib2 import socket. setdefatimetimeout (10) # timeout urllib2.socket. setdefatimetimeout (10) # Another Method

After Python 2.6, timeout can be directly set through the timeout parameter of urllib2.urlopen.

import urllib2  response = urllib2.urlopen('http://www.google.com', timeout=10)

3. Add a specific Header to the HTTP Request
To add a header, you must use the Request object:

import urllib2  request = urllib2.Request('http://www.baidu.com/')  request.add_header('User-Agent', 'fake-client')  response = urllib2.urlopen(request)  print response.read()

Pay special attention to some headers. The server will check these headers.

User-Agent: Some servers or proxies use this value to determine whether the request is sent by the browser.

Content-Type: When the REST interface is used, the server checks the value to determine how to parse the Content in the HTTP Body. Common values include:

Application/xml: Used in xml rpc, such as RESTful/SOAP calls

Application/json: used for json rpc calls

Application/x-www-form-urlencoded: used when the browser submits a Web form

When using the RESTful or SOAP service provided by the server, the Content-Type setting error may cause the server to reject the service.
4. Redirect

By default, urllib2 automatically performs a redirect action on the HTTP 3XX return code, without manual configuration. To check whether a redirect action has occurred, you only need to check whether the Response URL and Request URL are consistent.

import urllib2  my_url = 'http://www.google.cn'  response = urllib2.urlopen(my_url)  redirected = response.geturl() == my_url  print redirected  my_url = 'http://rrurl.cn/b1UZuP'  response = urllib2.urlopen(my_url)  redirected = response.geturl() == my_url  print redirected

If you do not want automatic redirect, you can customize the HTTPRedirectHandler class in addition to using the lower-level httplib library.

import urllib2  class RedirectHandler(urllib2.HTTPRedirectHandler):      def http_error_301(self, req, fp, code, msg, headers):          print "301"          pass      def http_error_302(self, req, fp, code, msg, headers):          print "303"          pass   opener = urllib2.build_opener(RedirectHandler)  opener.open('http://rrurl.cn/b1UZuP')

5. Cookie

Urllib2 automatically processes cookies. To obtain the value of a Cookie, you can do this:

import urllib2  import cookielib  cookie = cookielib.CookieJar()  opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))  response = opener.open('http://www.baidu.com')  for item in cookie:      print 'Name = '+item.name      print 'Value = '+item.value

After running the command, the Cookie value for accessing Baidu is output:

6. Use the PUT and DELETE methods of HTTP

Urllib2 only supports http get and POST methods. To use http put and DELETE methods, you can only use httplib libraries of lower layers. Even so, we can make urllib2 send a PUT or DELETE request in the following way:

import urllib2  request = urllib2.Request(uri, data=data)  request.get_method = lambda: 'PUT' # or 'DELETE'  response = urllib2.urlopen(request)

7. Get the HTTP return code

For 200 OK, you only need to use the getcode () method of the response object returned by urlopen to obtain the HTTP return code. However, for other return codes, urlopen throws an exception. At this time, we need to check the code attribute of the exception object:

import urllib2  try:      response = urllib2.urlopen('http://bbs.csdn.net/why')  except urllib2.HTTPError, e:      print e.code

8. Debug Log

When using urllib2, you can use the following method to open the debug Log, so that the content of the packet sending and receiving will be printed on the screen for debugging convenience. Sometimes you can save the packet capture work.

import urllib2  httpHandler = urllib2.HTTPHandler(debuglevel=1)  httpsHandler = urllib2.HTTPSHandler(debuglevel=1)  opener = urllib2.build_opener(httpHandler, httpsHandler)  urllib2.install_opener(opener)  response = urllib2.urlopen('http://www.google.com')

In this way, we can see the transmitted data packet content:

9. Form Processing

Do I need to fill in the form for Logon?

First, use a tool to intercept the content of the table to be filled in.

For example, I usually use the firefox + httpfox plug-in to see what packages I have actually sent.

Take verycd as an example. First, find your POST request and POST form items.

You can see that if verycd is used, you need to enter the username, password, continueURI, fk, and login_submit items, where fk is randomly generated (in fact, it is not random, it looks like the epoch time is generated by a simple code. You need to obtain the epoch time from the webpage. That is to say, you must first access the webpage and use regular expressions and other tools to intercept the fk items in the returned data. As the name suggests, continueURI can be written at will. login_submit is fixed, which can be seen from the source code. There is also username and password, which is obvious:

#-*-Coding: UTF-8-*-import urllib import urllib2 postdata = urllib. urlencode ({'username': 'wangxiaoguang ', 'Password': 'why888', 'continuuri': 'http: // www.verycd.com/', 'fk ':'', 'login _ submit ': 'login'}) req = urllib2.Request (url = 'HTTP: // secure.verycd.com/signin', data = postdata) result = urllib2.urlopen (req) print result. read ()

10. Disguised as browser access

Some websites dislike crawler visits, so they reject requests from crawlers.

In this case, we need to pretend to be a browser, which can be achieved by modifying the header in the http packet.

#…    headers = {      'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'  }  req = urllib2.Request(      url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/',      data = postdata,      headers = headers  )  #...

11. Deal with "anti-leeching"

Some sites have so-called anti-leeching settings, which are simple to describe,

It is to check whether the referer site itself is in the header of the request you sent,

So we only need to change the referer of headers to this website. The cnbeta is used as an example:

#...

Headers = {

'Referer': 'http: // www.cnbeta.com/articles'

}

#...

Headers is a dict data structure. You can put any desired header for some disguise.

For example, some websites like to read X-Forwarded-For from the header to see the real IP address of others. You can directly change X-Forwarde-.

Reference Source:
Guide to Using urllib2 to write python crawlers without basic knowledge
Http://www.lai18.com/content/384669.html

Additional reading

Collect and collect technical articles from the "no basic writing Python crawler" Series

1. Definition and URL composition of a python Crawler

2. Use urllib2 to capture webpage content

3 Guide to Using urllib2 to write python Crawlers

4. Two important concepts in urllib2: Openers and Handlers

5 zero-Basic python crawler-based HTTP Exception Handling

6. Zero-Basic write python crawler crawling Baidu post bar code sharing

7 basic python crawler-based Regular Expression

8 basic write python crawler full record

9 install and configure Scrapy, a crawler framework of zero-basic writing python Crawlers

10 basic write python crawler package to generate exe files

11 basic writing python crawler crawlers crawl Baidu Post bars and store them to local txt file Ultimate Edition

12. Zero-basic writing python crawlers: Capturing gossip encyclopedia code sharing

13 zero-Basic write python crawler using Scrapy framework to write Crawlers

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More