Webpage capturing means to read the network resources specified in the URL from the network stream and save them to the local device.
Similar to simulating the functions of IE browser using a program, the URL is sent to the server as the content of the HTTP request, and then the server's response resources are read.
In python, we use the urllib2 component to capture webpages.
Urllib2 is a python component used to obtain URLs (Uniform Resource Locators.
It provides a very simple interface in the form of urlopen functions.
The simplest urllib2 application code requires only four lines.
Create a new urllib2_test01.py file to check the role of urllib2:
import urllib2response = urllib2.urlopen('http://www.baidu.com/')html = response.read()print html
Press F5 to view the running result:
We can open the Baidu homepage, right-click it, and choose to view the source code (either Firefox or Google Chrome), and we will find the same content.
That is to say, the above four lines of code print out all the code we received when accessing Baidu.
This is the simplest urllib2 example.
In addition to "http:", URLs can also be replaced by "ftp:", "file:", and so on.
HTTP is based on the request and response mechanisms:
The client initiates a request and the server provides a response.
Urllib2 uses a request object to map your HTTP request.
In its simplest form of use, you will use the address you want to request to create a request object,
By calling urlopen and passing in the request object, a response object of the relevant request is returned,
This response object is like a file object, so you can call. Read () in response ().
Let's create a new urllib2_test02.py file to feel it:
import urllib2 req = urllib2.Request('http://www.baidu.com') response = urllib2.urlopen(req) the_page = response.read() print the_page
The output content is the same as test01.
Urllib2 uses the same interface to process all URL headers. For example, you can create an FTP request as follows.
req = urllib2.Request('ftp://example.com/')
In HTTP requests, you are allowed to perform two additional tasks.
1. Send data form data
This content is believed to be no stranger to Web terminals,
Sometimes you want to send some data to the URL (usually the URL is associated with the CGI [Universal gateway interface] script, or other Web applications ).
In HTTP, this is often sent using well-known POST requests.
This is usually done by your browser when you submit an HTML form.
Not all posts are from forms. You can use post to submit arbitrary data to your own program.
For general HTML forms, data must be encoded in the standard format. Then, it is uploaded as a data parameter to the request object.
Encoding uses the urllib function instead of urllib2.
Let's create a new urllib2_test03.py file to feel it:
Import urllib import urllib2 url = 'HTTP: // www.someserver.com/register.cgi' values = {'name': 'why', 'location': 'sdu ', 'language ': 'python'} DATA = urllib. urlencode (values) # encoding work Req = urllib2.request (URL, data) # send a request and pass the data form response = urllib2.urlopen (req) # accept the feedback the_page = response. read () # Read the feedback
If the data parameter is not transmitted, urllib2 uses the GET request.
The difference between get and post requests is that post requests usually have "Side effects ",
They may change the system status in some way (for example, submitting a pile of garbage to your door ).
Data can also be encoded in the GET request URL.
import urllib2 import urllibdata = {}data['name'] = 'WHY' data['location'] = 'SDU' data['language'] = 'Python'url_values = urllib.urlencode(data) print url_valuesname=Somebody+Here&language=Python&location=Northampton url = 'http://www.example.com/example.cgi' full_url = url + '?' + url_valuesdata = urllib2.open(full_url)
In this way, get transmission of data is realized.
2. Set headers to HTTP requests
Some websites do not like to be accessed by programs (not manually accessed), or send different versions of content to different browsers.
By default, urllib2 uses itself as "Python-urllib/x. y" (X and Y are the main Python version and minor version, such as Python-urllib/2.7 ),
This identity may confuse the site or simply stop working.
The browser confirms that its identity is through the User-Agent header. When you create a request object, you can give it a dictionary containing the header data.
The following example sends the same content as above, but simulates itself as Internet Explorer.
import urllib import urllib2 url = 'http://www.someserver.com/cgi-bin/register.cgi'user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values = {'name' : 'WHY', 'location' : 'SDU', 'language' : 'Python' } headers = { 'User-Agent' : user_agent } data = urllib.urlencode(values) req = urllib2.Request(url, data, headers) response = urllib2.urlopen(req) the_page = response.read()