Python urllib library, pythonurllib
Urllib in python2 and python3
Urllib provides an advanced Web communication library that supports basic Web protocols, such as HTTP, FTP, and Gopher. It also supports access to local files.
Specifically, the urllib module uses the preceding protocol to download data from the Internet, lan, and local hosts.
Httplib, ftplib, and gopherlib are not required to use this module unless you need lower-layer functions.
Python 2 contains urlib, urlparse, urllib2, and other content. In Python 3, all these modules are integrated into a single package named urllib.
Urlib and urlib2 are integrated into the urlib. request module, and urlparse is integrated into urllib. parse.
The urlib package in Python 3 also contains submodules such as response, error, and robotparse.
URL format
Prot_sch: // net_loc/path; params? Query # frag
Each part of the URL (each component of the Web address)
Prot_sch network protocol or Download Scheme net_loc server location (also containing user information) path use the slash (/) to split the file or CGI application path params optional parameter query connector (&) the split key-Value Pair net_loc can be further split into multiple components, some of which are essential and others are optional: user: passwd @ host: portuser user name or login passwd User Password host run web server computer name or address (required) port number (if not the default 80)
The urllib. parse module is called urlparse in python2 and has been renamed urllib. parse in python3.
The urllib. parse module provides some basic functions for processing URL strings. These functions include urlparse (), urlunparse (), and urljoin ().
Urlparse () parses urlstr into a 6-tuple (prot_sch, net_loc, path, params, query, frag ):
Syntax: urlparse (urlstr, defProtSch = None, allowFrag = None) >>> urllib. parse. urlparse ("https://www.smelond.com? Cat = 6 ") ParseResult (scheme = 'https', netloc = 'www .smelond.com ', path ='', params = '', query = 'cat = 6 ', fragment = '')
Urlunparse () is the opposite of urlpase (). It generates the urltup 6 tuples (prot_sch, net_loc, path, params, query, frag) from the URL processed by urlparse ), concatenate the URL and return:
Syntax: urlunparse (urltup) >>> result = urllib. parse. urlparse ("https://www.smelond.com")> print (result) ParseResult (scheme = 'https', netloc = 'www .smelond.com ', path = '', params = '', query = '', fragment ='') >>> urllib. parse. urlunparse (result) 'https: // www.smelond.com'
When we need to process multiple related URLs, we need to use the urljoin () function. For example, a Web page may produce a series of page urls:
Urljoin () gets the root domain name and connects its root path (net_loc and the complete path above it, but does not include the final file) to newurl.
Syntax: urljoin (baseurl, newurl, allowFrag = None) >>> urllib. parse. urljoin ("https://www.smelond.com? Cat = 6 ","? Cat = 7 ") 'https: // www.smelond.com? Cat = 7'> urllib. parse. urljoin ("https://www.smelond.com? Cat = 6 "," abc ") 'https: // www.smelond.com/abc'> urllib. parse. urljoin (" https://www.smelond.com? Cat = 6 ","/test/abc.html ") 'https: // www.smelond.com/test/abc.html' >>> urllib. parse. urljoin ("https://www.smelond.com", "abc.html") 'https: // www.smelond.com/abc.html'
Core Function Description in the urllib. parse Module
Urlparse (urlstr, defProSch = None, allowFrag = None) parses urlstr into each component. If no protocol or scheme is specified in urlstr, defProtSch is used; allowFrag determines whether URL fragment urlunparse (urltup) is allowed to combine a tuple of URL data (urltup) into a URL string urljoin (baseurl, newurl, allowFrag = None) combine the root domain name and newurl of the URL into a complete URL. The role of allowFrag is the same as that of urlpase ().