Advanced usage of Python crawler Urllib library, pythonurllib
1. Set Headers
Some websites do not agree that the program will directly use the above method for access. If there is a problem with identification, the site will not respond at all. Therefore, to fully simulate the work of the browser, we need to set some Headers attributes.
First, open our browser and debug the browser F12. I use Chrome to open the network listener, as shown in the following figure. For example, after logging in, we will find that the interface has changed after login, and a new interface appears. In fact, this page contains a lot of content, which is not loaded at once, in essence, it is to execute many requests. Generally, it first requests HTML files and then loads JS and CSS files. After multiple requests, the skeleton and muscles of the webpage are complete, the effect of the entire web page is displayed.
Split these requests. We only look at the first Request. You can see that there is a Request URL and headers. The following is response. The picture is incomplete. You can try it yourself. The header contains a lot of information, such as file encoding, compression, and request agent.
The agent is the request identity. If the request identity is not written, the server may not respond. Therefore, you can set the agent in headers. For example, in the following example, this example only shows how to set headers. Let's take a look at setting the format.
1 import urllib 2 import urllib2 3 4 url = 'http://www.server.com/login' 5 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 6 values = {'username' : 'cqc', 'password' : 'XXXX' } 7 headers = { 'User-Agent' : user_agent } 8 data = urllib.urlencode(values) 9 request = urllib2.Request(url, data, headers) 10 response = urllib2.urlopen(request) 11 page = response.read()
In this way, we set a headers, which is passed in when the request is built, and the headers transfer is added during the request. If the server identifies a request sent by the browser, it will get a response.
In addition, we haveAnti-leeching"The server will identify whether the referer In the headers is itself. If not, some servers will not respond, so we can also add a referer In the headers.
For example, we can build the following headers
1 headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' ,2 'Referer':'http://www.zhihu.com/articles' }
In the same method as above, the headers is passed into the Request parameter when the Request is sent, so that the anti-leech can be handled.
In addition, pay special attention to the following headers attributes:
User-Agent: Some servers or proxies use this value to determine whether the Content-Type request is sent by the browser. When using the REST interface, the server checks this value, used to determine how to parse the content in the HTTP Body. Application/xml: use application/json when calling xml rpc, such as RESTful/SOAP: use application/x-www-form-urlencoded when calling json rpc: when the browser submits a Web form using the RESTful or SOAP service provided by the server, the Content-Type setting error may cause the server to reject the service.
Others may need to review the browser's headers content and write the same data during build.
2. Proxy Settings
By default, urllib2 uses the environment variable http_proxy to set HTTP Proxy. If a website detects the number of visits to an IP address in a certain period of time, if the number of visits is too large, it will prohibit your access. So you can set up some proxy servers to help you do your work. If you change to a proxy every other time, the website has no idea who is playing tricks. This is cool!
The following code describes how to set the proxy.
1 import urllib22 enable_proxy = True3 proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})4 null_proxy_handler = urllib2.ProxyHandler({})5 if enable_proxy:6 opener = urllib2.build_opener(proxy_handler)7 else:8 opener = urllib2.build_opener(null_proxy_handler)9 urllib2.install_opener(opener)3.
Timeout settings
The urlopen method has been mentioned in the previous section. The third parameter is the timeout setting. You can set how long the wait times out. In order to solve the problem caused by slow response of some websites.
For example, in the following code, if the second parameter data is null, you must specify the timeout value and specify the form parameter. If the data has been passed in, you do not need to declare it.
1 import urllib22 response = urllib2.urlopen('http://www.baidu.com', timeout=10)
1 import urllib22 response = urllib2.urlopen('http://www.baidu.com',data, 10)
4. Use the PUT and DELETE methods of HTTP
There are six http request methods: get, head, put, delete, post, and options. Sometimes we need to use PUT or DELETE requests.
PUT: This method is rare. This is also not supported by HTML forms. Essentially, PUT and POST are very similar. They both send data to the server, but there is an important difference between them. PUT usually specifies the storage location of resources, while POST does not, the storage location of POST data is determined by the server.
DELETE: DELETE a resource. This is also rare, but there are still some places, such as the method used in amazon's S3 cloud service, to delete resources.
If you want to use http put and DELETE, you can only use httplib libraries of lower layers. Even so, we can still use the following method to enable urllib2 to send a PUT or DELETE request, but the number of times is indeed small. Here we will mention it.
1 import urllib22 request = urllib2.Request(uri, data=data)3 request.get_method = lambda: 'PUT' # or 'DELETE'4 response = urllib2.urlopen(request)
5. Use DebugLog
You can use the following method to open the Debug Log, so that the content of the packet sending and receiving will be printed on the screen for debugging convenience. This is not very common. just mention it.
1 import urllib22 httpHandler = urllib2.HTTPHandler(debuglevel=1)3 httpsHandler = urllib2.HTTPSHandler(debuglevel=1)4 opener = urllib2.build_opener(httpHandler, httpsHandler)5 urllib2.install_opener(opener)6 response = urllib2.urlopen('http://www.baidu.com')