Directory
- Stringr Introduction
- Stringr Installation
- Stringr's API Introduction
1. Stringr Introduction
The STRINGR package is defined as a consistent, easy-to-use string toolset. All functions and parameter definitions are consistent, for example, NA processing and 0-length vector processing in the same way.
String processing is not the main function of the R language, but it is also necessary, data cleansing, visualization and other operations will be used. For the R language itself, the base package provided by the string base function, with the accumulation of time, has become a lot of inconsistencies, non-canonical naming, not standard parameter definitions, it is difficult to take a look at the start of use. String processing is very handy in other languages, and the R language is really lagging behind. Stringr package is to solve this problem, so that the string processing becomes easy to use, provide a friendly string operation interface.
Stringr's Project home: https://cran.r-project.org/web/packages/stringr/index.html
2. Stringr Installation
The system environment used in this article
- Win10 64bit
- r:3.2.3 x86_64-w64-mingw32/x64 B4bit
Stringr is a standard library published in Cran, which is easy to install, with 2 commands.
~ R> install.packages(‘stringr‘)> library(stringr)
3. Stringr's API Introduction
Stringr Package 1.0.0 version, altogether provides 30 functions, facilitates us to the string processing. The processing of commonly used strings is named after the beginning of Str_, which makes it easier to understand the definition of a function more intuitively. We can classify functions according to usage habits:
string concatenation function
- Str_c: string concatenation.
- Str_join: string concatenation, same as str_c.
- Str_trim: Remove the space and tab of the string (\ t)
- Str_pad: The length of the supplemental string
- Str_dup: Copying strings
- Str_wrap: Control string output format
- Str_sub: Intercepting strings
- Str_sub<-intercepts strings and assigns values to the same str_sub
String calculation function
- Str_count: String Count
- Str_length: String length
- Str_sort: Sorting String values
- Str_order: String index Sort, rule same as Str_sort
String matching function
- Str_split: String Segmentation
- Str_split_fixed: String segmentation, same as Str_split
- Str_subset: Returns a matching string
- Word: Extracting words from text
- Str_detect: Checking for characters that match a string
- Str_match: Extracts a matching group from a string.
- Str_match_all: Extracts a matching group from a string, with Str_match
- Str_replace: String substitution
- Str_replace_all: string substitution, same as Str_replace
- Str_replace_na: Substituting na for na string
- Str_locate: The location of the matching string is found.
- Str_locate_all: Find the location of the matched string, same as Str_locate
- Str_extract: Extracting matching characters from a string
- Str_extract_all: Extracts matching characters from a string, same as Str_extract
String transformation functions
- Str_conv: Character encoding conversion
- Str_to_upper: string turns into uppercase
- Str_to_lower: string converted to lowercase, rule same as Str_to_upper
- Str_to_title: The string is capitalized in the first letter, the rule is the same as Str_to_upper
parameter control functions , which are used only to construct the parameters of the function, cannot be used independently.
- Boundary: Defining usage boundaries
- Coll: defines the string standard collation.
- Fixed: Defines the characters used for matching, including escape characters in regular expressions
- Regex: Defining Regular expressions
3.1 String concatenation function
3.1.1 Str_c, string concatenation operation, identical to Str_join, is not exactly the same as the paste () behavior.
function definition:
str_c(..., sep = "", collapse = NULL)str_join(..., sep = "", collapse = NULL)
Parameter list:
- ...: Multi-parameter input
- Sep: Concatenation of multiple strings into a large string, used for the delimiter of the string.
- Collapse: Stitching multiple vector parameters into a large string, used for the delimiter of the string.
Stitching multiple strings into a large string.
> str_c(‘a‘,‘b‘)[1] "ab"> str_c(‘a‘,‘b‘,sep=‘-‘)[1] "a-b"> str_c(c(‘a‘,‘a1‘),c(‘b‘,‘b1‘),sep=‘-‘)[1] "a-b" "a1-b1"
Stitching multiple vector parameters into a large string.
> str_c(head(letters), collapse = "")[1] "abcdef"> str_c(head(letters), collapse = ", ")[1] "a, b, c, d, e, f"# collapse参数,对多个字符串无效> str_c(‘a‘,‘b‘,collapse = "-") [1] "ab"> str_c(c(‘a‘,‘a1‘),c(‘b‘,‘b1‘),collapse=‘-‘)[1] "ab-a1b1"
When stitching a string vector with Na values, na or NA
> str_c(c("a", NA, "b"), "-d")[1] "a-d" NA "b-d"
Compare the differences between the str_c () function and the paste () function.
# 多字符串拼接,默认的sep参数行为不一致> str_c(‘a‘,‘b‘)[1] "ab"> paste(‘a‘,‘b‘)[1] "a b"# 向量拼接字符串,collapse参数的行为一致> str_c(head(letters), collapse = "")[1] "abcdef"> paste(head(letters), collapse = "")[1] "abcdef" #拼接有NA值的字符串向量,对NA的处理行为不一致> str_c(c("a", NA, "b"), "-d")[1] "a-d" NA "b-d"> paste(c("a", NA, "b"), "-d")[1] "a -d"
3.1.2 Str_trim: Remove the space and tab of the string (\ t)
function definition:
str_trim(string, side = c("both", "left", "right"))
Parameter list:
- String: Strings, String vectors.
- Side: Filter method, both both sides of the filter, left to filter, right filter
Remove the Space and tab of the string (\ t)
#只过滤左边的空格> str_trim(" left space\t\n",side=‘left‘) [1] "left space\t\n"#只过滤右边的空格> str_trim(" left space\t\n",side=‘right‘)[1] " left space"#过滤两边的空格> str_trim(" left space\t\n",side=‘both‘)[1] "left space"#过滤两边的空格> str_trim("\nno space\n\t")[1] "no space"
3.1.3 Str_pad: Length of supplementary string
function definition:
str_pad(string, width, side = c("left", "right", "both"), pad = " ")
Parameter list:
- String: Strings, String vectors.
- Width: Length after string padding
- Side: Fill direction, both both sides, left padding, right padding
- Pad: The character used for padding
The length of the complement string.
# 从左边补充空格,直到字符串长度为20> str_pad("conan", 20, "left")[1] " conan"# 从右边补充空格,直到字符串长度为20> str_pad("conan", 20, "right")[1] "conan "# 从左右两边各补充空格,直到字符串长度为20> str_pad("conan", 20, "both")[1] " conan "# 从左右两边各补充x字符,直到字符串长度为20> str_pad("conan", 20, "both",‘x‘)[1] "xxxxxxxconanxxxxxxxx"
3.1.4 Str_dup: Copying strings
function definition:
str_dup(string, times)
Parameter list:
- String: Strings, String vectors.
- Times: Copy Quantity
Copies a string vector.
> val <- c("abca4", 123, "cba2")# 复制2次> str_dup(val, 2)[1] "abca4abca4" "123123" "cba2cba2" # 按位置复制> str_dup(val, 1:3)[1] "abca4" "123123" "cba2cba2cba2"
3.1.5 Str_wrap, controlling string output format
function definition:
str_wrap(string, width = 80, indent = 0, exdent = 0)
Parameter list:
- String: Strings, String vectors.
- Width: Sets the length of the row.
- Indent: Indent value for first line of paragraph
- Exdent: Indent value for non-first line of paragraph
txt<-‘R语言作为统计学一门语言,一直在小众领域闪耀着光芒。直到大数据的爆发,R语言变成了一门炙手可热的数据分析的利器。随着越来越多的工程背景的人的加入,R语言的社区在迅速扩大成长。现在已不仅仅是统计领域,教育,银行,电商,互联网….都在使用R语言。‘# 设置宽度为40个字符> cat(str_wrap(txt, width = 40), "\n")R语言作为统计学一门语言,一直在小众领域闪耀着光芒。直到大数据的爆发,R语言变成了一门炙手可热的数据分析的利器。随着越来越多的工程背景的人的加入,R语言的社区在迅速扩大成长。现在已不仅仅是统计领域,教育,银行,电商,互联网….都在使用R语言。 # 设置宽度为60字符,首行缩进2字符> cat(str_wrap(txt, width = 60, indent = 2), "\n") R语言作为统计学一门语言,一直在小众领域闪耀着光芒。直到大数据的爆发,R语言变成了一门炙手可热的数据分析的利器。随着越来越多的工程背景的人的加入,R语言的社区在迅速扩大成长。现在已不仅仅是统计领域,教育,银行,电商,互联网….都在使用R语言。 # 设置宽度为10字符,非首行缩进4字符> cat(str_wrap(txt, width = 10, exdent = 4), "\n")R语言作为 统计学一 门语言, 一直在小 众领域闪 耀着光芒。 直到大数据 的爆发,R 语言变成了 一门炙手可 热的数据分 析的利器。 随着越来 越多的工程 背景的人的 加入,R语 言的社区在 迅速扩大成 长。现在已 不仅仅是统 计领域,教 育,银行, 电商,互联 网….都在使
3.1.6 Str_sub, intercepting strings
function definition:
str_sub(string, start = 1L, end = -1L)
Parameter list:
- String: Strings, String vectors.
- Start: Starting position
- End: Ending position
Intercepts a string.
> txt <- "I am Conan."# 截取1-4的索引位置的字符串> str_sub(txt, 1, 4)[1] "I am"# 截取1-6的索引位置的字符串> str_sub(txt, end=6)[1] "I am C"# 截取6到结束的索引位置的字符串> str_sub(txt, 6)[1] "Conan."# 分2段截取字符串> str_sub(txt, c(1, 4), c(6, 8))[1] "I am C" "m Con" # 通过负坐标截取字符串> str_sub(txt, -3)[1] "an."> str_sub(txt, end = -3)[1] "I am Cona"
Assigns a value to the intercepted string.
> x <- "AAABBBCCC"# 在字符串的1的位置赋值为1> str_sub(x, 1, 1) <- 1; x[1] "1AABBBCCC"# 在字符串从2到-2的位置赋值为2345> str_sub(x, 2, -2) <- "2345"; x[1] "12345C"
3.2 String Calculation function
3.2.1 Str_count, String count
function definition:
str_count(string, pattern = "")
Parameter list:
- String: Strings, String vectors.
- Pattern: matches the character.
Count of matched characters in a string
> str_count(‘aaa444sssddd‘, "a")[1] 3
Count of matched characters in a string vector
> fruit <- c("apple", "banana", "pear", "pineapple")> str_count(fruit, "a")[1] 1 3 1 1> str_count(fruit, "p")[1] 2 0 1 3
The '. ' In the string Character count, because. is a regular expression of the match, the direct judgment of the result of the count is not correct.
> str_count(c("a.", ".", ".a.",NA), ".")[1] 2 1 3 NA# 用fixed匹配字符> str_count(c("a.", ".", ".a.",NA), fixed("."))[1] 1 1 2 NA# 用\\匹配字符> str_count(c("a.", ".", ".a.",NA), "\\.")[1] 1 1 2 NA
3.2.2 Str_length, String length
function definition:
str_length(string)
Parameter list:
- String: Strings, String vectors.
To calculate the length of a string:
> str_length(c("I", "am", "张丹", NA))[1] 1 2 2 NA
3.2.3 Str_sort, sorting string values, sorting with Str_order indexes
function definition:
str_sort(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)str_order(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)
Parameter list:
- X: String, String vector.
- Decreasing: Sort direction.
- The location of the Na_last:na value, a total of 3 values, true put to the last, false put to the front, NA filter processing
- Locale: Sort by which language you are accustomed to
Sorts the string values.
# 按ASCII字母排序> str_sort(c(‘a‘,1,2,‘11‘), locale = "en") [1] "1" "11" "2" "a" # 倒序排序> str_sort(letters,decreasing=TRUE) [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h"[20] "g" "f" "e" "d" "c" "b" "a"# 按拼音排序> str_sort(c(‘你‘,‘好‘,‘粉‘,‘丝‘,‘日‘,‘志‘),locale = "zh") [1] "粉" "好" "你" "日" "丝" "志"
Sort processing of NA values
#把NA放最后面> str_sort(c(NA,‘1‘,NA),na_last=TRUE) [1] "1" NA NA #把NA放最前面> str_sort(c(NA,‘1‘,NA),na_last=FALSE) [1] NA NA "1"#去掉NA值 > str_sort(c(NA,‘1‘,NA),na_last=NA) [1] "1"
3.3 String Matching function
3.3.1 Str_split, String segmentation, with str_split_fixed
function definition:
str_split(string, pattern, n = Inf)str_split_fixed(string, pattern, n)
Parameter list:
- String: Strings, String vectors.
- Pattern: matches the character.
- N: Number of partitions
Splits a string.
> val <- "abc,123,234,iuuu"# 以,进行分割> s1<-str_split(val, ",");s1[[1]][1] "abc" "123" "234" "iuuu"# 以,进行分割,保留2块> s2<-str_split(val, ",",2);s2[[1]][1] "abc" "123,234,iuuu"# 查看str_split()函数操作的结果类型list> class(s1)[1] "list"# 用str_split_fixed()函数分割,结果类型是matrix> s3<-str_split_fixed(val, ",",2);s3 [,1] [,2] [1,] "abc" "123,234,iuuu"> class(s3)[1] "matrix"
3.3.2 Str_subset: Matching string returned
function definition:
str_subset(string, pattern)
Parameter list:
- String: Strings, String vectors.
- Pattern: matches the character.
> val <- c("abc", 123, "cba")# 全文匹配> str_subset(val, "a")[1] "abc" "cba"# 开头匹配> str_subset(val, "^a")[1] "abc"# 结尾匹配> str_subset(val, "a$")[1] "cba"
3.3.3 Word, extracting words from text
function definition:
word(string, start = 1L, end = start, sep = fixed(" "))
Parameter list:
- String: Strings, String vectors.
- Start: Starting position.
- End: the ending position.
- Sep: Matches the character.
> val <- c("I am Conan.", "http://fens.me, ok")# 默认以空格分割,取第一个位置的字符串> word(val, 1)[1] "I" "http://fens.me,"> word(val, -1)[1] "Conan." "ok" > word(val, 2, -1)[1] "am Conan." "ok" # 以,分割,取第一个位置的字符串 > val<-‘111,222,333,444‘> word(val, 1, sep = fixed(‘,‘))[1] "111"> word(val, 3, sep = fixed(‘,‘))[1] "333"
3.3.4 Str_detect matches the character of a string
function definition:
str_detect(string, pattern)
Parameter list:
- String: Strings, String vectors.
- Pattern: Match character.
> val <- c("abca4", 123, "cba2")# 检查字符串向量,是否包括a> str_detect(val, "a")[1] TRUE FALSE TRUE# 检查字符串向量,是否以a为开头> str_detect(val, "^a")[1] TRUE FALSE FALSE# 检查字符串向量,是否以a为结尾> str_detect(val, "a$")[1] FALSE FALSE FALSE
3.3.6 Str_match, extracting matching groups from a string
function definition:
str_match(string, pattern)str_match_all(string, pattern)
Parameter list:
- String: Strings, String vectors.
- Pattern: Match character.
Extracting a matching group from a string
> val <- c("abc", 123, "cba")# 匹配字符a,并返回对应的字符> str_match(val, "a") [,1][1,] "a" [2,] NA [3,] "a" # 匹配字符0-9,限1个,并返回对应的字符> str_match(val, "[0-9]") [,1][1,] NA [2,] "1" [3,] NA # 匹配字符0-9,不限数量,并返回对应的字符> str_match(val, "[0-9]*") [,1] [1,] "" [2,] "123"[3,] ""
Extracts a matching group from a string and returns it in the string matrix format
> str_match_all(val, "a")[[1]] [,1][1,] "a" [[2]] [,1][[3]] [,1][1,] "a" > str_match_all(val, "[0-9]")[[1]] [,1][[2]] [,1][1,] "1" [2,] "2" [3,] "3" [[3]] [,1]
3.3.7 Str_replace, String substitution
function definition:
str_replace(string, pattern, replacement)
Parameter list:
- String: Strings, String vectors.
- Pattern: Match character.
- Replacement: The character used for substitution.
> val <- c("abc", 123, "cba")# 把目标字符串第一个出现的a或b,替换为-> str_replace(val, "[ab]", "-")[1] "-bc" "123" "c-a"# 把目标字符串所有出现的a或b,替换为-> str_replace_all(val, "[ab]", "-")[1] "--c" "123" "c--"# 把目标字符串所有出现的a,替换为被转义的字符> str_replace_all(val, "[a]", "\1\1")[1] "\001\001bc" "123" "cb\001\001"
3.3.8 Str_replace_na to replace na with Na string
function definition:
str_replace_na(string, replacement = "NA")
Parameter list:
- String: Strings, String vectors.
- Replacement: The character used for substitution.
Replace Na with a string
> str_replace_na(c(NA,‘NA‘,"abc"),‘x‘)[1] "x" "NA" "abc"
3.3.9 Str_locate, finds the position of the pattern in the string.
function definition:
str_locate(string, pattern)str_locate_all(string, pattern)
Parameter list:
- String: Strings, String vectors.
- Pattern: Match character.
> val <- c("abca", 123, "cba")# 匹配a在字符串中的位置> str_locate(val, "a") start end[1,] 1 1[2,] NA NA[3,] 3 3# 用向量匹配> str_locate(val, c("a", 12, "b")) start end[1,] 1 1[2,] 1 2[3,] 2 2# 以字符串matrix格式返回> str_locate_all(val, "a")[[1]] start end[1,] 1 1[2,] 4 4[[2]] start end[[3]] start end[1,] 3 3# 匹配a或b字符,以字符串matrix格式返回> str_locate_all(val, "[ab]")[[1]] start end[1,] 1 1[2,] 2 2[3,] 4 4[[2]] start end[[3]] start end[1,] 2 2[2,] 3 3
3.3.10 Str_extract extracting a matching pattern from a string
function definition:
str_extract(string, pattern)str_extract_all(string, pattern, simplify = FALSE)
Parameter list:
- String: Strings, String vectors.
- Pattern: Match character.
- Simplify: Return value, True returns matrix,false return string vector
> val <- c("abca4", 123, "cba2")# 返回匹配的数字> str_extract(val, "\\d")[1] "4" "1" "2"# 返回匹配的字符> str_extract(val, "[a-z]+")[1] "abca" NA "cba" > val <- c("abca4", 123, "cba2")> str_extract_all(val, "\\d")[[1]][1] "4"[[2]][1] "1" "2" "3"[[3]][1] "2"> str_extract_all(val, "[a-z]+")[[1]][1] "abca"[[2]]character(0)[[3]][1] "cba"
3.4 String Transformation functions
3.4.1 Str_conv: Character encoding conversion
function definition:
str_conv(string, encoding)
Parameter list:
- String: Strings, String vectors.
- Encoding: the encoding name.
Transcoding of Chinese is handled.
# 把中文字符字节化> x <- charToRaw(‘你好‘);x[1] c4 e3 ba c3# 默认win系统字符集为GBK,GB2312为GBK字集,转码正常> str_conv(x, "GBK")[1] "你好"> str_conv(x, "GB2312")[1] "你好"# 转UTF-8失败> str_conv(x, "UTF-8")[1] "???"Warning messages:1: In stri_conv(string, encoding, "UTF-8") : input data \xffffffc4 in current source encoding could not be converted to Unicode2: In stri_conv(string, encoding, "UTF-8") : input data \xffffffe3\xffffffba in current source encoding could not be converted to Unicode3: In stri_conv(string, encoding, "UTF-8") : input data \xffffffc3 in current source encoding could not be converted to Unicode
Turn Unicode into UTF-8
> x1 <- "\u5317\u4eac"> str_conv(x1, "UTF-8")[1] "北京"
3.4.2 Str_to_upper, string capitalization conversion.
function definition:
str_to_upper(string, locale = "")str_to_lower(string, locale = "")str_to_title(string, locale = "")
Parameter list:
- String: Strings.
- Locale: Sort by which language you are accustomed to
String Capitalization conversions:
> val <- "I am conan. Welcome to my blog! http://fens.me"# 全大写> str_to_upper(val)[1] "I AM CONAN. WELCOME TO MY BLOG! HTTP://FENS.ME"# 全小写> str_to_lower(val)[1] "i am conan. welcome to my blog! http://fens.me"# 首字母大写> str_to_title(val)[1] "I Am Conan. Welcome To My Blog! Http://Fens.Me"
String is often used in the usual data processing, need to divide, connect, transform, and so on, in this article through the introduction of STRINGR, flexible string processing library, can effectively improve the coding efficiency. With good tools, working with strings in the R language is handy.
reproduced in: R Language String processing package Stringr
---------------------------------
Common features: # Merge strings Fruit <-C ("Apple", "banana", "pear", "pinapple") Res <-str_c (1:4,fruit,sep= ", collapse=") str_c (' I w Ant to buy ', res,collapse= ') # Calculates the string length Str_length (c ("I", "like", "Programming R", 123,res)) # take substring by position str_sub (fruit, 1, 3) # substring re-assigned capital <-toupper (Str_sub (fruit,1,1)) str_sub (fruit, Rep (1,4), Rep (1,4)) <-Capital # repeating string Str_dup (fruit , C (1,2,3,4)) # Plus blank str_pad (fruit, ten, "both") # Remove blank Str_trim (fruit) # Check if match str_detect (fruit, "a$") according to regular expression str_ Detect (Fruit, "[Aeiou]") # find the matching string position str_locate (fruit, "a") # Extract the matching section str_extract (fruit, "[a-z]+") Str_match (Fruit, "[ a-z]+ ") # replaces matched portions of str_replace (fruit," [Aeiou] ","-") # Split Str_split (res," ")
R----Stringr Package Introduction Learning