A common method of string segmentation in Python is to call a str.split
method of a string directly, but it can specify only one delimiter, and the re.split
method (the split method of the regular expression) is required if you want to specify multiple separators to split the string.
Str.split
The Split method function of the string is prototyped as follows, where Sep is the specified delimiter and maxsplit is the maximum number of splits:
1 |
Str.split (sep=None, maxsplit=-1) |
By default, separators are split with whitespace characters (spaces, carriage returns, tabs, and so on) when no delimiter is specified:
1234 |
>>> s = ' A b\tc\nd ' >>> s.split () [' A ', ' B ', ' C ', ' D ']>>> |
In the results list, the empty string is not included:
1234 |
>>> s = ' A b\tc\nd\n\n ' >>> s.split () [' A ', ' B ', ' C ', ' D ']>>> |
Specify delimiter:
1234567 |
>>> s = ' www.google.com ' >>> s.split ('. ') [' www ', ' Google ', ' com '] >>> s = ' aa| | bb| | cc| | DD '>>> s.split (' | | ') [' AA ', ' BB ', ' CC ', ' DD ']>>> |
Specify the maximum number of splits:
1234567 |
>>> s = ' www.google.com ' >>> s.split ('. ', 1) [' www ', ' google.com '] >>> s = ' aa| | bb| | cc| | DD '>>> s.split (' | | ', 2)[' AA ', ' BB ', ' cc| | DD ']>>> |
Thus, when the maximum number of splits is specified maxsplit
, the result list length is maxsplit+1
.
However, the split method of the string can specify only one delimiter, as follows:
1 |
s = ' aaaa,bbbb:cccc;dddd ' |
If you want to specify commas, colons, and semicolons as delimiters, the split method of the string is not available, and the Split method in the regular expression is used.
Re.split
The split method of the regular expression is prototyped as follows, where pattern is the specified delimited regular expression, string is the character to be split, Maxsplit is the maximum number of splits, and flags is the generic flag used by the regular expression:
1 |
Re.split (Pattern, string, maxsplit=0, flags=0) |
Reference Example:
12345 |
>>> import re >>> s = ' aaaa,bbbb:cccc;dddd ' >>> re.split (R ' [,:;] ', s) [' AAAA ', ' BBBB ', ' CCCC ', ' DDDD ']>>> |
If you use a capturing group in a regular expression that is parentheses, the resulting list also contains the captured content:
12345 |
>>> import re >>> s = ' aaaa,bbbb:cccc;dddd ' >>> re.split (R ' ([,:;]) ', s) [' AAAA ', ', ', ' BBBB ', ': ', ' CCCC ', '; ', ' DDDD '] >>> |
If you do not want to see the delimiter in the results, but still want to group the regular expression pattern with parentheses, you can specify it in the form of a non-capturing group, (?:...)
as in the following example:
12345 |
>>> import re >>> s = ' aaaa,bbbb:cccc;dddd ' >>> re.split (R ' (?: [,:;]) ', s) [' AAAA ', ' BBBB ', ' CCCC ', ' DDDD ']>>> |
Specify the maximum number of splits:
1234567 |
>>> import re >>> s = ' aaaa,bbbb:cccc;dddd ' >>> re.split (R ' [,:;] ', S, 1) [' AAAA ', ' bbbb:cccc;dddd '] >>> re.split (R ' [,:;] ', S, 2) [' AAAA ', ' BBBB ', ' cccc;dddd ']>>> |
Thus, when the maximum number of splits is specified maxsplit
, the result list length is maxsplit+1
.
Specify a generic flag in a regular expression flags:
1234 |
>>> import re >>> re.split (' [a-f]+ ', ' 0a3b9 ', flags=re. IGNORECASE)[' 0 ', ' 3 ', ' 9 ']>>> |
Original link: http://www.revotu.com/python-split-string-methods.html
Summary of Python string segmentation method