In my opinion, there are two main uses of regular expressions: ① find specific information ② find and edit specific information, which is the replacement we often use. For example, we want to use the shortcut key ctrl+f in Word, Notepad, etc. to find a specific character, or to replace a character, which uses a regular expression.
The function of regular expressions is very powerful, especially in the processing of text data. The functions of grep, GREPL, Sub, gsub, regexpr, and gregexpr in R are matched using regular expression rules. These function prototypes are as follows:
grep (pattern, x, Ignore.case = False, Perl = false, Value = False, fixed = false, Usebytes = false, Invert = false)
GREPL (pattern, x, Ignore.case = False, Perl = False, fixed = false, Usebytes = False) sub (pattern, replacement, X, Ignore.case = False, Perl = False, fixed = false, Usebytes = False) Gsub (pattern, replacement, X, Ignore.case = False, Perl = False, fixed = false, Usebytes = False) regexpr (pattern, text, Ignore.case = False, Perl = false,
fixed = False, Usebytes = False) gregexpr (pattern, text, Ignore.case = False, Perl = False, fixed = false, use Bytes = False) regexec (pattern, text, Ignore.case = False, Perl = False, fixed = false, Usebytes = False)
Here is an explanation of the parameters.
Parameters |
Description |
Pattern |
Regular expressions |
X, text |
A character vector or character object that, in the later version of R 3.0.0, supports more than 2^31 character elements. |
Ignore.case |
The default is False, which indicates case-sensitive and true when it is not case-sensitive. |
Perl |
Whether to use Perl-compatible regular expressions |
Value |
The default is False, which returns 1 if found, otherwise returns 0, or 0 if the whole x,text is returned for true lookup. |
Fixed |
If True,pattern is the string to match. Override all conflicting parameters |
Usebytes |
The default is false, which, when true, is byte-by-byte matching instead of verbatim character-by-character matching. |
Invert |
If True returns the index or value of the element that does not match. |
Replacement |
If it is found, replace it and return the X,text value if it is not found |
Next, we'll talk about the differences between the several functions.
Function |
Role |
grep () |
Lookup, existence parameter value, return result is subscript of match |
GREPL () |
Lookup, the return value is True |
Sub () |
Replace only the first content that is found. (Same as Next) |
Gsub () |
Replaces all the found contents, returns the replaced text, or returns the text directly |
REGEXPR () |
Returns an integer vector that is the same length as the first match's starting position, or 1 if not, and "Match.length" gives the integer vector (or-1) of the matched text length. Match position and length are characters. |
In addition to regexec, regular expressions for Perl style () are not currently supported. 。 The main effect of usebytes is to avoid errors/warnings about invalid input and pseudo-matching in multi-byte locales, but for regexpr, it changes the interpretation of the output. It prohibits the conversion of an input with a tag encoding, and if any input is found to be marked as "byte", it is forced to see the encoding). The unrelated match does not make much sense to the bytes in the multibyte locale, and if usebytes = TRUE, you should expect it to only work with ASCII characters. Regexpr and gregexpr with Perl = True allow Python-style named captures, but not long vector inputs. Invalid input in the current locale has a maximum of 5 warnings. For non-ASCII characters, an unsigned match with Perl = True depends on the Pcre library compiled with Unicode attribute support: The external library may not be. If you are doing a lot of regular expression matching, including very long strings, usually set the regular expression engine to pcre, which will be faster than the default regular expression engine, and fixed = true faster (especially if each pattern matches only a few times). |
GREGEXPR () |
Returns a list of the same length as the text, with each element having the same format as the return value of regexpr, in addition to the starting position of each (disjoint) match. |
Regexec () |
Returns a list of the same length as the text, or 1 if there is no match, or a sequence of integers with a matching starting position, and all substrings corresponding to the pattern's brace subexpression, where the property "match. Length" gives the vector of the matching length (or no match-1). |
Here are some summaries of my regular expression escape characters, but the functions of grep, GREPL, Sub, gsub, regexpr, gregexpr in the R language do not support escaping with "\".
Regular expression escape character |
Blank Meta characters |
[\b] |
Fallback (and delete) a character (backspace) |
\f |
Page break |
\ n |
Line break |
\ r |
Carriage return character |
\ t |
tab (TAB) |
\v |
Vertical tab |
Note: \ r \ n is the text line terminator used by Windows, UNIX and Linux just end a line of text with a newline character |
Match numbers with non-numbers |
\d |
Any numeric character, equivalent to [0-9] |
\d |
Any non-numeric character, equivalent to ^[0-9] |
Match letters \ Non-letters and numbers |
\w |
Any alphanumeric character (either uppercase or lowercase) or an underscore character (equivalent to [a-za-z0-9]) |
\w |
Any non-alphanumeric or underscore character (equivalent to [^a-za-z0-9]) |
Match white space characters |
\s |
Any one whitespace character (equivalent to [\f\n\r\t\v]) |
\s |
Any non-whitespace character (equivalent to [^\f\n\r\t\v]) |
POSIX character class |
[: Alnum:] |
Any one letter or number (equivalent to [a-za-z0-9]) |
[: Alpha:] |
Any one letter (equivalent to [a-za-z]) |
[: Blank:] |
Space or tab (equivalent to [\ t]) Note: There is a space behind T |
[: Cntrl:] |
ASCII control characters (ASCII 0 to 31, plus ASCII 127) |
[:d Igit:] |
Any number (equivalent to [0-9]) |
[: Graph:] |
Same as [:p rint:], but does not include spaces |
[: Lower:] |
Any lowercase letter (equivalent to [A-z]) |
[:p rint:] |
any one printable character |
[:p UNCT:] |
Neither [: Alnum:] Nor does it belong to any one of the [: Cntrl:] Characters |
[: Space:] |
Any one space character, including spaces (equivalent to [f\n\r\t\v] Note: There is a space behind V |
[: Upper:] |
Any uppercase letter (equivalent to [A-z]) |
[: Xdigit:] |
Any hexadecimal number (equivalent to [a-fa-f0-9]) |
Other |
. |
can match any single character alphanumeric or even. The character itself. The same regular expression allows multiple. Characters to be used. But does not match line breaks |
\\ |
Escape character and write "\ \" If you want to match |
| |
Represents an optional, either before or after the expression |
^ |
Take non-matching |
$ |
Put at the end of a sentence, indicating that a line of string ends |
() |
Extracts a matched string, (\\s*) A string representing contiguous spaces |
[] |
Select any one of the brackets (such as [0-2] and [012] exactly equivalent, [RR] is responsible for matching the letter R and R) |
{} |
The number of repetitions of the preceding character or expression. If {5,12} indicates that the number of repetitions cannot be less than 5, not more than 12, otherwise they do not match |
* |
Matches 0 or any number of characters or character sets, or it can have no matching |
+ |
Match one or more characters, match at least once |
? |
Match 0 or one character |
Now let's give some examples.
First use the function of the [] bracket to look for words with a do combination.
Text<-c ("Don ' t", "aim", "for", "success", "if", "You", "want", "it", "just", "does", "what", "I", "Love", "and ", " Believe "," in "," and "," it "," would "," Come "," naturally ") #查找含有DO组合的单词 grep (" [Dd]o ", text) #不区分大小写 grep (" [ D]o ", text) #D要大写 grep (" [D]o ", text) #D小写
The results of the operation are as follows:
> text<-c ("Don ' t", "aim", "for", "success", "if", "You", "want", "it", "just", "does", "what", "I", "Love", + "and", "Believe", "in", "and", "it", "would", "Come", "naturally") > > #查找含有DO组合的单词 > grep ("[Dd]o", text) #不区分大小写 [1] 1 > grep ("[D]o", text) #D要大写 [1] 1 > grep ("[D]o", text) #D小写 [1] 10
Mailbox match:
#邮箱匹配: text2<-c ("[Email protected] is my email address.") Grepl ("[0-9.*][email protected][a-z.*].[ A-z.*] ", Text2)
> text2<-c ("[Email protected] is my email address.") > Grepl ("[0-9.*][email protected][a-z.*].[ A-z.*] ", Text2) [1] TRUE
You can already find the mailbox.
Follow the "Learn R language with rookie" reply to regular expressions You must know that you will be able to get the download link.
Reprint please specify the original CSDN link: http://blog.csdn.NET/wzgl__wh/article/details/52938475
The regular expression of R-language