Python3 Crawler 4--Parse link

Last Update:2017-09-14 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.urlparse ()

Belongs to Urllib.parse

In Urlparse World, a standard URL link format is as follows

Scheme://nrtlooc/path;paramters?query#fragment

So, a url=‘http://www.baidu.com/index.html;user?id=5#comment‘

If we use Urlparse, we can be divided into 6 parts.

(scheme= ' http ', netloc= ' www.baidu.com ', path= ' index.html ' paramters= ' user '. query= ' id=5 ', fragment= ' comment ')

Here's how:

Res=urlparse ( ‘https://www.baidu.com/baidu?wd=query&tn=monline_dg&ie=utf-8‘ )

Print (RES)

Urlparse with parameters is the method of use

Res=urlparse (urlstring,scheme= ", allow_fragment=true)

Scheme is the default protocol, and if URLString does not have an agreement, use the Protocol in scheme and, if so, still use the URLString protocol

Allow_fragment is whether to ignore fragment, if False,fragment is parsed as part of path, paramenters, or query

2,urlunparse ()

Belongs to Urllib.parse

As its name implies, Ulrunparse () is the inverse process of urlparse ()

For example: data=[' http ', ' www.baidu.com ', ' index.html ', ' user ', ' a=6 ', ' comment ']

Print (Urlunparse (data))

This completes the construction of the urlstring.

3urlsplit ()

From Urllib.parse import Urlsplit

Similar to Urlparse, but urlsplict divides urlstirng into 5 parts, of which less paramters

Res=urlsplict (" http://www.baidu.com/index.html;user?id=5#comment )

Print (RES)

4urlunsplit ()

Usage similar to Urlunparse ()

5urljoin ()

Belongs to Urllib.parse

Urljoin () is also a way to generate urlstring, which is to provide two links, respectively Base_url, and new links, analyze the Scheme,netloc,path in Base_url three parts, and then supplement the missing part of the new link , if there is a new link inside, then do not add, do not mention, and finally return to the new link, for example

Print (Urljoin (' http://www.baidu.com ', ' wd=query&tn=monline_dg&ie=utf-8 '))

The return result is:

Http://www.baidu.com/wd=query&tn=monline_dg&ie=utf-8
6urlencode ()

From Urllib,parse import UrlEncode

You can convert a dictionary type to a URL parameter for example,

param={' name ': ' Lihua ', ' age ': ' 23 '}

Base_url= ' http://www.baidu.com '

Url=base_url+urlencode (param)

Print (URL)

7parse_qs ()

Parse_qs () is the inverse process of parse_encode () (Why the difference between the names is so great that I cannot solve them)

From Urllib.parse import Parse_qs

query= ' Wd=query&tn=monline_dg&ie=utf-8 '

Print (Parse_qs (query))

The output is: {' tn ': [' MONLINE_DG '], ' wd ': [' query '], ' ie ': [' Utf-8 ']}
So the conversion is called a dictionary type.

8PARS_QSL ()

From urllib.pase import PARSE_QSL: Convert a parameter to a list of tuples

query= ' Wd=query&tn=monline_dg&ie=utf-8 '

Print (PARSE_QSL (query))

Output results: [(' WD ', ' query '), (' TN ', ' MONLINE_DG '), (' IE ', ' utf-8 ')]

9quote

The quote () method can convert content to URL encoding format, sometimes URLs with Chinese may cause garbled characters, which requires quote

From Urllib. Parse Import Quote

Keyword= ' Beauty '

Url= ' https://www.baidu.com/s?wd= ' +quote (keyword)

Print (URL)

Output Result: https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3

10unquote ()

Decoding a URL

From Urllib.parse Import unquote

Url= ' https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3 '

Print (unquote (URL))

Output Result: https://www.baidu.com/s?wd= Beauty
You can implement decoding

Python3 Crawler 4--Parse link

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More