1.urlparse ()
Belongs to Urllib.parse
In Urlparse World, a standard URL link format is as follows
Scheme://nrtlooc/path;paramters?query#fragment
So, a url=‘http://www.baidu.com/index.html;user?id=5#comment‘
If we use Urlparse, we can be divided into 6 parts.
(scheme= ' http ', netloc= ' www.baidu.com ', path= ' index.html ' paramters= ' user '. query= ' id=5 ', fragment= ' comment ')
Here's how:
Res=urlparse ( ‘https://www.baidu.com/baidu?wd=query&tn=monline_dg&ie=utf-8‘
)
Print (RES)
Urlparse with parameters is the method of use
Res=urlparse (urlstring,scheme= ", allow_fragment=true)
Scheme is the default protocol, and if URLString does not have an agreement, use the Protocol in scheme and, if so, still use the URLString protocol
Allow_fragment is whether to ignore fragment, if False,fragment is parsed as part of path, paramenters, or query
2,urlunparse ()
Belongs to Urllib.parse
As its name implies, Ulrunparse () is the inverse process of urlparse ()
For example: data=[' http ', ' www.baidu.com ', ' index.html ', ' user ', ' a=6 ', ' comment ']
Print (Urlunparse (data))
This completes the construction of the urlstring.
3urlsplit ()
From Urllib.parse import Urlsplit
Similar to Urlparse, but urlsplict divides urlstirng into 5 parts, of which less paramters
Res=urlsplict (" http://www.baidu.com/index.html;user?id=5#comment
)
Print (RES)
4urlunsplit ()
Usage similar to Urlunparse ()
5urljoin ()
Belongs to Urllib.parse
Urljoin () is also a way to generate urlstring, which is to provide two links, respectively Base_url, and new links, analyze the Scheme,netloc,path in Base_url three parts, and then supplement the missing part of the new link , if there is a new link inside, then do not add, do not mention, and finally return to the new link, for example
Print (Urljoin (' http://www.baidu.com ', ' wd=query&tn=monline_dg&ie=utf-8 '))
The return result is:
Http://www.baidu.com/wd=query&tn=monline_dg&ie=utf-8
6urlencode ()
From Urllib,parse import UrlEncode
You can convert a dictionary type to a URL parameter for example,
param={' name ': ' Lihua ', ' age ': ' 23 '}
Base_url= ' http://www.baidu.com '
Url=base_url+urlencode (param)
Print (URL)
7parse_qs ()
Parse_qs () is the inverse process of parse_encode () (Why the difference between the names is so great that I cannot solve them)
From Urllib.parse import Parse_qs
query= ' Wd=query&tn=monline_dg&ie=utf-8 '
Print (Parse_qs (query))
The output is: {' tn ': [' MONLINE_DG '], ' wd ': [' query '], ' ie ': [' Utf-8 ']}
So the conversion is called a dictionary type.
8PARS_QSL ()
From urllib.pase import PARSE_QSL: Convert a parameter to a list of tuples
query= ' Wd=query&tn=monline_dg&ie=utf-8 '
Print (PARSE_QSL (query))
Output results: [(' WD ', ' query '), (' TN ', ' MONLINE_DG '), (' IE ', ' utf-8 ')]
9quote
The quote () method can convert content to URL encoding format, sometimes URLs with Chinese may cause garbled characters, which requires quote
From Urllib. Parse Import Quote
Keyword= ' Beauty '
Url= ' https://www.baidu.com/s?wd= ' +quote (keyword)
Print (URL)
Output Result: https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3
10unquote ()
Decoding a URL
From Urllib.parse Import unquote
Url= ' https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3 '
Print (unquote (URL))
Output Result: https://www.baidu.com/s?wd= Beauty
You can implement decoding
Python3 Crawler 4--Parse link