1, use the Python library urllib2, use the Urlopen and the request method.
2, Method Urlopen prototype
urllib2. Urlopen (url[, data][, timeout])
url represents the destination Web page address, which can be a string, It can also be requested object request
data represents the parameters submitted to the target server by post
timeout indicates a time-out setting of
example:
[python] view plaincopy
- >>> IMPORT URLLIB2  
- >>> socket = urllib2.urlopen ( "http://www.baidu.com")
- >>> content = socket.read ()
- >>> socket.close ()
In this way, the contents of the Web page crawl down, but some sites prohibit crawlers, if the direct request will be the following error:
Urllib2. Httperror:http Error 403:forbidden
The workaround is to use the request method to add header information to the browser, masquerading as the access behavior of the viewer:
3. method Request prototype
urllib2. Request (url[, data][, headers][, origin_req_host][, unverifiable])
which
The URL represents the destination Web page address, which can be a string or request object
Data represents the parameters that the Post method submits to the target server
Headers represents a user ID, is a dictionary type of data, some do not allow the script to crawl, so the need for user agents, such as the agent of the Firefox browser is similar: mozilla/5.0 (X11; U Linux i686) The standard UA format for gecko/20071127 firefox/2.0.0.11 browser is: Browser identification (operating system identity; encryption level identification; browser language) The rendering engine identifies version information, Headers default is python-urllib/2.6
Origin_req_host represents the host domain name or IP address of the requester
See an example:
[Python]View Plaincopy
- >>> headers = {' user-agent ':' mozilla/5.0 (X11; U Linux i686) gecko/20071127 firefox/2.0.0.11 '}
- >>> req = urllib2. Request (url="Http://blog.csdn.net/deqingguo", Headers=headers)
- >>> socket = Urllib2.urlopen (req)
- >>> content = Socket.read ()
- >>> Socket.close ()
2:
Import urllib2 as Ulurl = ' http://www.dd.com/products?selected.classification=primary+antibodies& Selected.researchareas=metabolism--types+of+disease--cancer ' headers = {' user-agent ': ' mozilla/5.0 (X11; U Linux i686) gecko/20071127 firefox/2.0.0.11 '} req = ul. Request (url,headers=headers) f = ul.urlopen (req) content = F.read ();p rint f.getcode ();
[Python]View Plaincopy
- <pre></pre>
- <p></p>
- <pre></pre>
Python crawls Web pages using URLLIB2