The regular expression of R-language

Source: Internet
Author: User
Tags control characters locale uppercase letter expression engine

In my opinion, there are two main uses of regular expressions: ① find specific information ② find and edit specific information, which is the replacement we often use. For example, we want to use the shortcut key ctrl+f in Word, Notepad, etc. to find a specific character, or to replace a character, which uses a regular expression.

The function of regular expressions is very powerful, especially in the processing of text data. The functions of grep, GREPL, Sub, gsub, regexpr, and gregexpr in R are matched using regular expression rules. These function prototypes are as follows:

grep (pattern, x, Ignore.case = False, Perl = false, Value = False,       fixed = false, Usebytes = false, Invert = false) 
   GREPL (pattern, x, Ignore.case = False, Perl = False,        fixed = false, Usebytes = False)    sub (pattern, replacement,  X, Ignore.case = False, Perl = False,      fixed = false, Usebytes = False)    Gsub (pattern, replacement, X, Ignore.case = False, Perl = False,       fixed = false, Usebytes = False)    regexpr (pattern, text, Ignore.case = False, Perl = false,
   fixed = False, Usebytes = False)    gregexpr (pattern, text, Ignore.case = False, Perl = False,           fixed = false, use Bytes = False)    regexec (pattern, text, Ignore.case = False, Perl = False,          fixed = false, Usebytes = False)  

  

Here is an explanation of the parameters.

Parameters

Description

Pattern

Regular expressions

X, text

A character vector or character object that, in the later version of R 3.0.0, supports more than 2^31 character elements.

Ignore.case

The default is False, which indicates case-sensitive and true when it is not case-sensitive.

Perl

Whether to use Perl-compatible regular expressions

Value

The default is False, which returns 1 if found, otherwise returns 0, or 0 if the whole x,text is returned for true lookup.

Fixed

If True,pattern is the string to match. Override all conflicting parameters

Usebytes

The default is false, which, when true, is byte-by-byte matching instead of verbatim character-by-character matching.

Invert

If True returns the index or value of the element that does not match.

Replacement

If it is found, replace it and return the X,text value if it is not found


Next, we'll talk about the differences between the several functions.

Function

Role

grep ()

Lookup, existence parameter value, return result is subscript of match

GREPL ()

Lookup, the return value is True

Sub ()

Replace only the first content that is found. (Same as Next)

Gsub ()

Replaces all the found contents, returns the replaced text, or returns the text directly

REGEXPR ()

Returns an integer vector that is the same length as the first match's starting position, or 1 if not, and "Match.length" gives the integer vector (or-1) of the matched text length. Match position and length are characters.

In addition to regexec, regular expressions for Perl style () are not currently supported.

。 The main effect of usebytes is to avoid errors/warnings about invalid input and pseudo-matching in multi-byte locales, but for regexpr, it changes the interpretation of the output. It prohibits the conversion of an input with a tag encoding, and if any input is found to be marked as "byte", it is forced to see the encoding).

The unrelated match does not make much sense to the bytes in the multibyte locale, and if usebytes = TRUE, you should expect it to only work with ASCII characters.

Regexpr and gregexpr with Perl = True allow Python-style named captures, but not long vector inputs.

Invalid input in the current locale has a maximum of 5 warnings.

For non-ASCII characters, an unsigned match with Perl = True depends on the Pcre library compiled with Unicode attribute support: The external library may not be.

If you are doing a lot of regular expression matching, including very long strings, usually set the regular expression engine to pcre, which will be faster than the default regular expression engine, and fixed = true faster (especially if each pattern matches only a few times).

GREGEXPR ()

Returns a list of the same length as the text, with each element having the same format as the return value of regexpr, in addition to the starting position of each (disjoint) match.

Regexec ()

Returns a list of the same length as the text, or 1 if there is no match, or a sequence of integers with a matching starting position, and all substrings corresponding to the pattern's brace subexpression, where the property "match. Length" gives the vector of the matching length (or no match-1).

Here are some summaries of my regular expression escape characters, but the functions of grep, GREPL, Sub, gsub, regexpr, gregexpr in the R language do not support escaping with "\".

Regular expression escape character
Blank Meta characters [\b] Fallback (and delete) a character (backspace)
\f Page break
\ n Line break
\ r Carriage return character
\ t tab (TAB)
\v Vertical tab
Note: \ r \ n is the text line terminator used by Windows, UNIX and Linux just end a line of text with a newline character
Match numbers with non-numbers \d Any numeric character, equivalent to [0-9]
\d Any non-numeric character, equivalent to ^[0-9]
Match letters \ Non-letters and numbers \w Any alphanumeric character (either uppercase or lowercase) or an underscore character (equivalent to [a-za-z0-9])
\w Any non-alphanumeric or underscore character (equivalent to [^a-za-z0-9])
Match white space characters \s Any one whitespace character (equivalent to [\f\n\r\t\v])
\s Any non-whitespace character (equivalent to [^\f\n\r\t\v])
POSIX character class [: Alnum:] Any one letter or number (equivalent to [a-za-z0-9])
[: Alpha:] Any one letter (equivalent to [a-za-z])
[: Blank:] Space or tab (equivalent to [\ t]) Note: There is a space behind T
[: Cntrl:] ASCII control characters (ASCII 0 to 31, plus ASCII 127)
[:d Igit:] Any number (equivalent to [0-9])
[: Graph:] Same as [:p rint:], but does not include spaces
[: Lower:] Any lowercase letter (equivalent to [A-z])
[:p rint:] any one printable character
[:p UNCT:] Neither [: Alnum:] Nor does it belong to any one of the [: Cntrl:] Characters
[: Space:] Any one space character, including spaces (equivalent to [f\n\r\t\v] Note: There is a space behind V
[: Upper:] Any uppercase letter (equivalent to [A-z])
[: Xdigit:] Any hexadecimal number (equivalent to [a-fa-f0-9])
Other . can match any single character alphanumeric or even. The character itself. The same regular expression allows multiple. Characters to be used. But does not match line breaks
\\ Escape character and write "\ \" If you want to match
| Represents an optional, either before or after the expression
^ Take non-matching
$ Put at the end of a sentence, indicating that a line of string ends
() Extracts a matched string, (\\s*) A string representing contiguous spaces
[] Select any one of the brackets (such as [0-2] and [012] exactly equivalent, [RR] is responsible for matching the letter R and R)
{} The number of repetitions of the preceding character or expression. If {5,12} indicates that the number of repetitions cannot be less than 5, not more than 12, otherwise they do not match
* Matches 0 or any number of characters or character sets, or it can have no matching
+ Match one or more characters, match at least once
? Match 0 or one character

Now let's give some examples.

First use the function of the [] bracket to look for words with a do combination.

Text<-c ("Don ' t", "aim", "for", "success", "if", "You", "want", "it", "just", "does", "what", "I", "Love", "and          ", " Believe "," in "," and "," it "," would "," Come "," naturally ")    #查找含有DO组合的单词  grep (" [Dd]o ", text) #不区分大小写  grep (" [ D]o ", text) #D要大写  grep (" [D]o ", text) #D小写  

The results of the operation are as follows:

> text<-c ("Don ' t", "aim", "for", "success", "if", "You", "want", "it", "just", "does", "what", "I", "Love",  +         "and", "Believe", "in", "and", "it", "would", "Come", "naturally")  >   > #查找含有DO组合的单词  > grep ("[Dd]o", text) #不区分大小写  [1]  1  > grep ("[D]o", text) #D要大写  [1] 1  > grep ("[D]o", text) #D小写  [1] 10  

Mailbox match:

#邮箱匹配:  text2<-c ("[Email protected] is my email address.")  Grepl ("[0-9.*][email protected][a-z.*].[ A-z.*] ", Text2)  

  

> text2<-c ("[Email protected] is my email address.")  > Grepl ("[0-9.*][email protected][a-z.*].[ A-z.*] ", Text2)  [1] TRUE  

  

You can already find the mailbox.


Follow the "Learn R language with rookie" reply to regular expressions You must know that you will be able to get the download link.

Reprint please specify the original CSDN link: http://blog.csdn.NET/wzgl__wh/article/details/52938475

The regular expression of R-language

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.