»python Crawler Four advanced usage of the Urllib library
1. Set headers
Some sites do not agree to the program directly in the way of access, if the identification of the problem, then the site will not respond, so in order to fully simulate the work of the browser, we need to set some headers properties.
First of all, open our browser, debugging browser F12, I use Chrome, open the network monitoring, as shown below, for example, after the login, we will find that the interface has changed after landing, a new interface, essentially this page contains a lot of content, These content is not a one-time loading completed, in essence, the implementation of a good number of requests, is generally the first request for HTML files, and then load js,css and so on, after many requests, the skeleton and muscle of the Web page, the effect of the entire Web page out.
Split these requests, we only see a first request, you can see, there is a request URL, and headers, the following is response, the picture is not full, the small partners can experiment with their own hands. So this header contains a lot of information, file encoding, compression, request agent, and so on.
Where the agent is the identity of the request, if there is no write request identity, then the server does not necessarily respond, so you can set up the agent in headers, such as the following example, this example just explains how to set the headers, the small partners to look at the format is good.
Introduction to Python Crawler advanced usage of Urllib Library four