These two days to learn Python3 implementation Crawl Web resources method, found a lot of ways, so, today add a little bit of notes.
These two days to learn Python3 implementation Crawl Web resources method, found a lot of ways, so, today add a little bit of notes.
1, the simplest
Import urllib.requestresponse = Urllib.request.urlopen (' http://python.org/') HTML = Response.read ()
2. Use Request
Import Urllib.request req = urllib.request.Request (' http://python.org/') response = Urllib.request.urlopen (req) the_ page = Response.read ()
3. Send data
#! /usr/bin/env python3 import urllib.parseimport urllib.request url = ' http://localhost/login.php ' user_agent = ' mozilla/ 4.0 (compatible; MSIE 5.5; Windows NT) ' values = { ' act ': ' Login ', ' login[email ': ' yzhang@i9i8.com ', ' login[password ': ' 123456 ' } data = Urllib.parse.urlencode (values) req = urllib.request.Request (URL, data) req.add_header (' Referer ', '/HTTP/ www.python.org/') response = Urllib.request.urlopen (req) the_page = Response.read () print (The_page.decode ("UTF8"))
4. Send data and headers
#! /usr/bin/env python3 import urllib.parseimport urllib.request url = ' http://localhost/login.php ' user_agent = ' mozilla/ 4.0 (compatible; MSIE 5.5; Windows NT) ' values = { ' act ': ' Login ', ' login[email ': ' yzhang@i9i8.com ', ' login[password ': ' 123456 ' }headers = {' user-agent ': user_agent} data = Urllib.parse.urlencode (values) req = urllib.request.Request (URL, data, Headers) Response = Urllib.request.urlopen (req) the_page = Response.read () print (The_page.decode ("UTF8"))
5. HTTP Error
#! /usr/bin/env Python3 Import Urllib.request req = urllib.request.Request (' http://www.python.org/fish.html ') Try: Urllib.request.urlopen (req) except Urllib.error.HTTPError as E: print (E.code) print (E.read (). Decode ("UTF8" ))
6. Exception Handling 1
#! /usr/bin/env Python3 from urllib.request import request, Urlopenfrom urllib.error import urlerror, httperrorreq = Request ( "http://twitter.com/") Try: response = Urlopen (req) except Httperror as E: print (' The server couldn\ ' t fulfill The request. ') Print (' Error code: ', E.code) except Urlerror as E: print (' We failed to reach a server. ') Print (' Reason: ', E.reason) Else: print ("good!") Print (Response.read (). Decode ("UTF8"))
7. Exception Handling 2
#! /usr/bin/env Python3 from urllib.request import request, Urlopenfrom urllib.error Import urlerrorreq = Request ("HTTP://TW itter.com/") Try: response = Urlopen (req) except Urlerror as E: if Hasattr (E, ' reason '): print (' We failed to Reach a server. ') Print (' Reason: ', E.reason) elif hasattr (E, ' Code '): print (' The server couldn\ ' t fulfill the request.) Print (' Error code: ', E.code) Else: print ("good!") Print (Response.read (). Decode ("UTF8"))
8. HTTP Authentication
#! /usr/bin/env python3 Import urllib.request # Create a password Managerpassword_mgr = Urllib.request.HTTPPasswordMgrWithDefaultRealm () # ADD the username and password.# If we knew the realm, we could use it I Nstead of None.top_level_url = "https://cms.tetx.com/" Password_mgr.add_password (None, Top_level_url, ' Yzhang ', ' Cccddd ') handler = Urllib.request.HTTPBasicAuthHandler (password_mgr) # create "opener" (Openerdirector instance) opener = Urllib.request.build_opener (handler) # Use the opener to fetch a urla_url = "https://cms.tetx.com/" x = Opener.open (a_url ) Print (X.read ()) # Install The opener.# now all calls to Urllib.request.urlopen with our Opener.urllib.request.install_open ER (opener) a = Urllib.request.urlopen (A_url). Read (). Decode (' UTF8 ') print (a)
9, the use of agents
#! /usr/bin/env python3 import Urllib.request proxy_support = Urllib.request.ProxyHandler ({' Sock5 ': ' localhost:1080 '}) Opener = Urllib.request.build_opener (Proxy_support) Urllib.request.install_opener (opener) A = Urllib.request.urlopen ("http://g.cn"). Read (). Decode ("UTF8") print (a)
10. Timeout
#! /usr/bin/env Python3 Import socketimport urllib.request # timeout in secondstimeout = 2socket.setdefaulttimeout (timeout) # Urllib.request.urlopen now uses the default timeout# we had set in the socket modulereq = URLLIB.REQUEST.R Equest (' http://twitter.com/') a = Urllib.request.urlopen (req). Read () print (a)
"Recommended"
1. Python Free video tutorial
2. Python Learning Manual
3. Marco Education Python Basic grammar full explanation video