R Language Learning notes: string processing

Source: Internet
Author: User
Tags true true fitbit2 stringr

To generate a file name for a graphics file in the R language, the prefix is Fitbit, followed by the month after, plus ". jpg", first not Baidu, tried other languages similar grammar, not a viable:

In C #: "Fitbit" + month + ". jpg"

VB: "Fitbit" & Month & ". jpg"

Haskell: "Fitbit" + + month + + ". jpg"

Also think of concat and other functions, are not, it seems can only help, the original must be used with a paste function.

Paste () and PASTE0 (): Connection string

Paste () can not only concatenate multiple strings, but also automatically convert objects to strings, and it can also handle vectors, so it is more powerful.

Paste ("Fitbit", Month, ". jpg", sep= "")

The special place of this function is that the default delimiter is a space, so sep= "" must be specified so that a string such as fitbit10.jpg is generated if month=10.

There is also a paste0 function, the default is sep= ""

So Paste0 ("Fitbit", Month, ". jpg") is a little more concise than the previous code.

To generate the Fitbit file name for 12 months:

Paste ("Fitbit", 1:12, ". jpg", Sep = "")

[1] "fitbit1.jpg" "Fitbit2.jpg" "Fitbit3.jpg" "Fitbit4.jpg" "Fitbit5.jpg" "Fitbit6.jpg" "Fitbit7.jpg"

[8] "fitbit8.jpg" "Fitbit9.jpg" "Fitbit10.jpg" "Fitbit11.jpg" "Fitbit12.jpg"

You can see the parameters in the vector when the capture of the effect of stitching, if a vector is short, automatically filled:

A <-C ("A", "B", "C", "Ding", "e", "own", "Geng", "Xin", "Ren", "GUI")

b <-C ("Zi", "ugly", "Yin", "Mao", "Chen", "WU", "noon", "not", "Shen", "unitary", "Xu", "Hai")

Paste0 (A, B)

[1] "Koshien" "Yi Yi" "C Yin" "Ding Mao" "Boshin" "has been" "Grand Afternoon" "Xinwei" "Wang Shen" "GUI" "A Xu" "B Hai"

Paste also has a collapse parameter that can be used to spell these strings into a long string instead of being placed in a vector.

> Paste ("Fitbit", 1:3, ". jpg", Sep = "", collapse = ";")

[1] "fitbit1.jpg; fitbit2.jpg; Fitbit3.jpg "

nchar (): To find the number of characters

NCHAR () is able to get the length of a string, which differs from the result of length ().

NCHAR (c ("abc", "ABCD") # to find the number of characters in a string, return vector C (3, 4)

Length (C ("abc", "ABCD")) # returns 2, the number of elements in the vector

ToLower (x) and ToUpper (x): Case conversion.

Don't say much.

Strsplit: String Segmentation

Strsplit ("2014-10-30 2262 10367 7.4 1231", split= "")

[[1]]

[1] "2014-10-30" "2262" "10367" "7.4" "18" "1231" "77" "88" "44"

In fact, this function supports very powerful regular expressions.

Now you want to automatically generate the October chart for Fitbit and save it as a file fitbit_month_10.jpg.

M <-10

JPEG (PASTE0 ("Fitbit_month_", M, ". jpg"))

Monthdata <-fitbit[as.double (Format (fitbit$date, "%m")) ==m,]

Plot (Format (monthdata$date, "%d"), Monthdata$step, type= "L", xlab= "date", ylab= "Steps", Main=paste ("2014", M, "Month Step" chart ", sep=" "))

Dev.off ()

Other content has not been collated, first put the information found on the internet below.

===============================================

Although the main processing objects of the R language are numbers, strings sometimes account for a significant portion of the data analysis. Especially in the context of the increasingly important text data mining, in the data preprocessing phase you need to skillfully manipulate the string object. Of course, if you are good at other processing software, such as Python, you can make it responsible for the dirty work of the previous period.

String intercept: substr () can take a subset of a given string object, whose parameters are the starting and ending positions of the subset.

String substitution: gsub () is responsible for searching the specific expression of the string and substituting the new content. The sub () function is similar, but replaces only the first found result.

String matching: grep () is responsible for searching for a specific expression in a given string object and returning its position index. The GREPL () function is similar, but the "L" after it means that the return will be a logical value.

An example:
Let's take a look at an example of handling a message to extract the sender's address from the text. The text can be downloaded here. The full text of the message is as follows:
----------------------------
Return-path: [Email protected]
Delivery-date:sat Sep 7 05:46:01 2002
From: [Email protected] (Skip Montanaro)
Date:fri, 6 SEP 2002 23:46:01-0500
Subject: [Spambayes] Speed
Message-id: <[email protected]>

If the frequency of my laptop ' s disk chirps is any indication, I ' d say
Hammie is about 3-5x faster than SpamAssassin.

Skip
----------------------------

# Use the ReadLines function to read the full message from a local file.
Data <-readlines (' data ')
# Determine the class of the object, OK is a text-type vector, each line of text is an element of the vector.
Class (data)
# Find the line containing the "from:" string from this text vector
Email <-data[grepl (' From: ', data)]
#将其按照空格进行分割, it is divided into a string vector consisting of four elements.
From <-strsplit (email, ")
# The result above is a list format, which turns into a vector format.
From <-unlist (from)
# Finally, search for the element that contains ' @ ', which is the sender's email address.
From <-from[grepl (' @ ', from)]

Regular Expressions (Regular Expressions) are usually included in complex operations of strings, and can be referenced in this context. Regex

#字符串截取:
SUBSTR (x, Start, stop)
SUBSTRING (text, first, last = 1000000)
SUBSTR (x, Start, stop) <-value
SUBSTRING (text, first, last = 1000000) <-value

#字符串替换及大小写转换:
CHARTR (old, new, X)
Casefold (x, upper = FALSE)
#匹配相关的函数:
Character exactly matches
grep ()
Characters do not match exactly
Agrep ()
Character substitution
Gsub ()
#以上这些函数均可以通过perl =true to use regular expressions.
grep (pattern, x, ignore.case = FALSE, extended = TRUE,
Perl = false, value = false, fixed = false, Usebytes = False)

Sub (pattern, replacement, X,
Ignore.case = False, extended = TRUE, Perl = False,
Fixed = false, Usebytes = False)

Gsub (pattern, replacement, X,
Ignore.case = False, extended = TRUE, Perl = False,
Fixed = false, Usebytes = False)

REGEXPR (pattern, text, Ignore.case = FALSE, extended = TRUE,
Perl = false, fixed = false, Usebytes = False)

GREGEXPR (pattern, text, Ignore.case = FALSE, extended = TRUE,
Perl = false, fixed = false, Usebytes = False)
See Also:

Regular expression (aka ' regexp ') for the details of the pattern
Specification.

' Glob2rx ' to turn wildcard matches into regular expressions.

' Agrep ' for approximate matching.

' ToLower ', ' toupper ' and ' chartr ' for character translations.
' Charmatch ', ' pmatch ', ' match '. ' Apropos ' uses regexps and has
Nice examples.

5 String query:
5.1 grep and GREPL functions:
These two functions return the matching result of the vector level, and do not involve the detailed location information of the matching string.

grep (pattern, x, Ignore.case = False, Perl = false, value = false, fixed = False, Usebytes =false, invert = False) Grepl (p Attern, X, Ignore.case = False, Perl = false, fixed = false, Usebytes = False)

Although the parameters look similar, the returned results are different. The example below lists all the files in the C:\windows directory and then uses grep and GREPL to find the exe file:

Files <-list.files ("C:/windows") grep ("\\.exe$", files)

# # [1] 8 28 30 35 36 58 69 99 100 102 111 112 115 117

Grepl ("\\.exe$", files)

# # [1] False to false false false TRUE False if
# # [[+] false false false, False if
# # [+] false false false false false true false if
# # [[+] False True True false false "to"/"to" to "to" to "to"
# # [+] false false to false false
# # [[+]] false false TRUE false False if you
# # [%] False false TRUE false False if
# # [+] false false to False to false false
# # [[[]] false false to false false false TRUE
# # [[+] True false true false false if it evaluates to "* *" "
# # [111] True false to false true false

grep returns only the subscript for the match, and GREPL returns all the query results, using a logical vector to indicate that there are no matches found. The results of both are the same for extracting subsets of data:

Files[grep ("\\.exe$", files)]

# # [1] "Bfsvc.exe" "Explorer.exe" "Fveupdate.exe" "HelpPane.exe" # # [5] "hh.exe" "notepad.exe" "Regedit.exe" "Twunk_16.ex E "# # [9]" Twunk_32.exe "" Uninst.exe "" WinHelp.exe "" WinHlp32.exe "# # # []" Write.exe "" Xinstaller.exe "

Files[grepl ("\\.exe$", files)]

# # [1] "Bfsvc.exe" "Explorer.exe" "Fveupdate.exe" "HelpPane.exe" # # [5] "hh.exe" "notepad.exe" "Regedit.exe" "Twunk_16.ex E "# # [9]" Twunk_32.exe "" Uninst.exe "" WinHelp.exe "" WinHlp32.exe "# # # []" Write.exe "" Xinstaller.exe "

5.2 regexpr, gregexpr and Regexec
The results returned by these three functions contain the exact location of the match and the string length information that can be used for the extraction of the string.

Text <-C ("Hellow, adam!", "Hi, adam!", "How is You, Adam.") regexpr ("Adam", text)

# # [1] 9 5 # # attr (, "Match.length") # # [1] 4 4 4 # # attr (, "Usebytes") # # # [1] TRUE

gregexpr ("Adam", text)

# # [[1]]
# # [1] 9
# # attr (, "Match.length")
# # [1] 4 # # attr (, "usebytes")
# # [1] TRUE # # # # # # # # [[2]] # # [1] 5
# # attr (, "Match.length")
# # [1] 4
# # attr (, "usebytes")
# # [1] TRUE
# # # # # [[3]]
# # [1] 14
# # attr (, "Match.length")
# # [1] 4
# # attr (, "usebytes")
# # [1] TRUE

Regexec ("Adam", text)

# # [[1]]
# # [1] 9
# # attr (, "Match.length")
# # [1] 4
# # # # # [[2]]
# # [1] 5
# # attr (, "Match.length")
# # [1] 4
# # # # # [[3]]
# # [1] 14
# # attr (, "Match.length")
# # [1] 4

6 string substitution
6.1 Sub and GSUB functions
Although sub and gsub are functions used for string substitution, the R language strictly does not have a function of string substitution, because the R language does not address the value of the parameter, regardless of its operation.

Text

# # [1] "Hellow, adam!" "Hi, adam!." "How is it, Adam."

Sub (pattern = "Adam", replacement = "World", text)

# # [1] "Hellow, world!" "Hi, world!." "How is it, world?"

Text

# # [1] "Hellow, adam!" "Hi, adam!." "How is it, Adam."

As you can see, the original string is not changed, although it is said to be "replaced", we can only pass the method of re-assignment to change the original variable. The difference between sub and gsub is that the former only makes one replacement (no matter how many matches), and Gsub replaces the match that satisfies the condition:

Sub (pattern = "adam| Ava ", replacement =" World ", text)

# # [1] "Hellow, world!" "Hi, world!." "How is it, world?"

Gsub (pattern = "adam| Ava ", replacement =" World ", text)

# # [1] "Hellow, world!" "Hi, world!." "How is it, world?"

The sub and GSUB functions can use an extract expression (escape character + number) to make the part all:

Sub (pattern = ". * (Adam). *", replacement = "\\1", text)

# # [1] "Adam" "Adam" "Adam"

7 String Extraction
7.1 Substr and SUBSTRING functions
The substr and substring functions divide or extract strings by location, and they do not use regular expressions themselves, but combining regular expression functions regexpr, gregexpr, or regexec makes it easy to extract the required information from a large amount of text. The parameter settings for both are basically the same:

SUBSTR (x, start, stop) substring (text, first, last = 1000000L)
X is a vector of strings to be split
Start/first vector for the starting position of the Intercept
Stop/last is the terminating position vector of the intercept string
However, there are differences in the length (number) of their return values:
SUBSTR returns the number of strings equal to the length of the first argument
The substring returns the number of strings equal to the longest vector length in three parameters, and the short vector is used in a loop.
The length of the 1th parameter (the character vector to be split) is 1 examples:

X <-"123456789" substr (x, C (2, 4), C (4, 5, 8))

# # [1] "234"

SUBSTRING (x, C (2, 4), C (4, 5, 8))

# # [1] "234" "45" "2345678"

Because x has a vector length of 1, the result of substr is only 1 strings, i.e. the 2nd and 3rd parameter vectors only use the first combination: Start position 2, end position 4. While the substring statement of the longest vector in the three parameters is C (4,5,8), the execution of the rule according to the short vector loop the first parameter is actually C (x,x,x), the second parameter is C (2,4,2), the final truncated string starting position combination is: 2-4, 4-5 and 2-8.
Follow these processing rules to interpret the results of the following statement:

X <-C ("123456789", "ABCDEFGHIJKLMNOPQ") substr (x, C (2, 4), C (4, 5, 8))

# # [1] "234" "de"

SUBSTRING (x, C (2, 4), C (4, 5, 8))

# # [1] "234" "de" "2345678"

The SUBSTRING function makes it easy to divide the Dna/rna sequence into triple splitting (for protein translation):

Bases <-C ("A", "T", "G", "C") DNA <-paste (sample (bases, B, replace = t), collapse = "") DNA

# # [1] "GCAGCGCATATG"

substring (DNA, SEQ (1, by = 3), SEQ (3, A, by = 3))

# # [1] "GCA" "GCG" "CAT" "ATG"

Use the regexpr, gregexpr, or regexec function to get location information and then extract the string to try it yourself.
8 Other:
8.1 Strtrim function
Used to trim a string to a specific display width using Strtrim (x, width), which returns the length of the string vector equal to the length of X. Because it is "trimmed," you can only remove extra characters without adding additional characters: If the length of the string itself is less than width, get the original string, and don't expect it to be padded with spaces or other characters:

Strtrim (C ("ABCdef", "abcdef", "abcdef"), C (1, 5, 10))

# # [1] "a" "ABCDE" "abcdef"

Strtrim (c (1, 123, 1234567), 4)

# # [1] "1" "123" "1234"

8.2 Strwrap function
The function treats a string as a paragraph literal (regardless of whether there is a newline character in the string), branches in the format of the paragraph (indent and length) and hyphenation, and each line is a string in the result. For example:

Str1 <-"Each character a string in the input was first split into a paragraphs\n (or lines containing whitespace only). The paragraphs is then\nformatted by breaking lines at word boundaries. The Target\ncolumns for wrapping lines and the indentation of the first and\nall subsequent lines of a paragraph can be CO ntrolled\nindependently. "Str2 <-Rep (str1, 2) strwrap (str2, width = +, indent = 2)

# # [1] "each character string in the input was first split into paragraphs (or lines")
# # [2] "containing whitespace only). The paragraphs is then formatted by breaking "
# # [3] "lines at word boundaries. The target columns for wrapping lines and the "
# # [4] "indentation of the first and all subsequent lines of a paragraph can"
# # [5] "controlled independently."
# # [6] "each character string in the input was first split into paragraphs (or lines")
# # [7] "containing whitespace only). The paragraphs is then formatted by breaking "
# # [8] "lines at word boundaries. The target columns for wrapping lines and the "
# # [9] "indentation of the first and all subsequent lines of a paragraph can"
# # [Ten] "controlled independently."

The simplify parameter is used to specify the return style of the result, which defaults to true, that is, all strings in the result are placed sequentially in a string vector (above), and if false, the result is a list. Another parameter, exdent, is used to specify the indentation of the line except for the first line:

Strwrap (str1, width = up, indent = 0, exdent = 2)

# # [1] "each character string in the input was first split into paragraphs (or lines")
# # [2] "containing whitespace only). The paragraphs is then formatted by breaking "
# # [3] "lines at word boundaries. The target columns for wrapping lines and the "
# # [4] "indentation of the first and all subsequent lines of a paragraph can"
# # [5] "controlled independently."

8.3 Match and Charmatch

Match ("XX", C ("abc", "XX", "xxx", "XX"))

# # [1] 2

Match (2, C (3, 1, 2, 4))

# # [1] 3

Charmatch ("xx", "XX")

# # [1] 1

Charmatch ("xx", "Xxa")

# # [1] 1

Charmatch ("xx", "Axx")

# # [1] NA

Match is performed by the vector, returning the position of the first matched element (if any), and the non-character vector is also available. Charmatch function really pit daddy. Others do not look, in fact, there is a regular expression is enough.

=================================

It is said that there is also a Stringr package, the original character processing function was packaged, unified the function name and parameters. On the basis of enhancements, it is also possible to process vectorization data and be compatible with non-character data. The STRINGR package claims to reduce the time it takes to process characters by 95%. Here are some of the main functions listed below.

Library (STRINGR)

# Merging strings
Fruit <-C ("Apple", "banana", "pear", "pinapple")
Res <-str_c (1:4,fruit,sep= ", collapse=")
Str_c (' I want to buy ', res,collapse= ')

# Calculate string length
Str_length (C ("I", "like", "Programming R", 123,res))

# take substring by position
Str_sub (fruit,1,3)
# Sub-string re-assign value
Capital <-toupper (Str_sub (fruit,1,1))
Str_sub (Fruit,rep (1,4), Rep (1,4)) <-Capital

# repeating string
Str_dup (Fruit,c (1,2,3,4))

# Plus Blank
Str_pad (fruit,10, "both")
# Remove whitespace
Str_trim (Fruit)

# Check for matches based on regular expressions
Str_detect (Fruit, "a$")
Str_detect (Fruit, "[aeiou]")

# Find a matching string position
Str_locate (Fruit, "a")

# Extract the matching parts
Str_extract (Fruit, "[a-z]+")
Str_match (Fruit, "[a-z]+")

# Replace matching parts
Str_replace (Fruit, "[Aeiou]", "-")

# split
Str_split (res, "")

R Language Learning notes: string processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.