Webpage capturing means to read the network resources specified in the URL from the network stream and save them to the local device. Version: Python2.7.5 and Python3 are greatly changed. For more information, see the tutorial.
Webpage capturing means to read the network resources specified in the URL from the network stream and save them to the local device.
Similar to simulating the functions of IE browser using a program, the URL is sent to the server as the content of the HTTP request, and then the server's response resources are read.
In Python, we use the urllib2 component to Capture webpages.
Urllib2 is a Python component used to obtain URLs (Uniform Resource Locators.
It provides a very simple interface in the form of urlopen functions.
The simplest urllib2 application code requires only four lines.
Create a new urllib2_test01.py file to check the role of urllib2:
import urllib2 response = urllib2.urlopen('http://www.baidu.com/') html = response.read() print html
Press F5 to view the running result:
Let's create a new urllib2_test02.py file to feel it:
import urllib2 req = urllib2.Request('http://www.baidu.com') response = urllib2.urlopen(req) the_page = response.read() print the_page
The output content is the same as test01.
Urllib2 uses the same interface to process all URL headers. For example, you can create an ftp request as follows.
req = urllib2.Request('ftp://example.com/')
In HTTP requests, you are allowed to perform two additional tasks.
1. send data form data
This content is believed to be no stranger to Web terminals,
Sometimes you want to send some data to the URL (usually the URL is associated with the CGI [Universal Gateway Interface] script, or other WEB applications ).
In HTTP, this is often sent using well-known POST requests.
This is usually done by your browser when you submit an HTML form.
Not all POSTs are from forms. you can use POST to submit arbitrary data to your own program.
For general HTML forms, data must be encoded in the standard format. Then, it is uploaded as a data parameter to the Request object.
Encoding uses the urllib function instead of urllib2.
Let's create a new urllib2_test03.py file to feel it:
Import urllib import urllib2 url = 'http: // www.someserver.com/register.cgi' values = {'name': 'why', 'Location': 'sdu ', 'language ': 'python'} data = urllib. urlencode (values) # encoding work req = urllib2.Request (url, data) # send a request and pass the data form response = urllib2.urlopen (req) # accept the feedback the_page = response. read () # read the feedback
If the data parameter is not transmitted, urllib2 uses the GET request.
The difference between GET and POST requests is that POST requests usually have "Side Effects ",
They may change the system status in some way (for example, submitting a pile of garbage to your door ).
Data can also be encoded in the Get request URL.
import urllib2 import urllib data = {} data['name'] = 'WHY' data['location'] = 'SDU' data['language'] = 'Python' url_values = urllib.urlencode(data) print url_values name=Somebody+Here&language=Python&location=Northampton url = 'http://www.example.com/example.cgi' full_url = url + '?' + url_values data = urllib2.open(full_url)
In this way, Get transmission of Data is realized.
2. set Headers to http requests
Some websites do not like to be accessed by programs (not manually accessed), or send different versions of content to different browsers.
By default, urllib2 uses itself as "Python-urllib/x. y" (x and y are the main Python version and minor version, such as Python-urllib/2.7 ),
This identity may confuse the site or simply stop working.
The browser confirms that its identity is through the User-Agent header. when you create a request object, you can give it a dictionary containing the header data.
The following example sends the same content as above, but simulates itself as Internet Explorer.
(Thank you for reminding me that this Demo is no longer available, but the principle remains the same ).
import urllib import urllib2 url = 'http://www.someserver.com/cgi-bin/register.cgi' user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values = {'name' : 'WHY', 'location' : 'SDU', 'language' : 'Python' } headers = { 'User-Agent' : user_agent } data = urllib.urlencode(values) req = urllib2.Request(url, data, headers) response = urllib2.urlopen(req) the_page = response.read()
The above is the [Python] web crawler (2): uses urllib2 to capture the content of the webpage through the specified URL. For more information, see The PHP Chinese website (www.php1.cn )!