R----Stringr Package Introduction Learning

Source: Internet
Author: User
Tags index sort stringr

Directory

    1. Stringr Introduction
    2. Stringr Installation
    3. Stringr's API Introduction
1. Stringr Introduction

The STRINGR package is defined as a consistent, easy-to-use string toolset. All functions and parameter definitions are consistent, for example, NA processing and 0-length vector processing in the same way.

String processing is not the main function of the R language, but it is also necessary, data cleansing, visualization and other operations will be used. For the R language itself, the base package provided by the string base function, with the accumulation of time, has become a lot of inconsistencies, non-canonical naming, not standard parameter definitions, it is difficult to take a look at the start of use. String processing is very handy in other languages, and the R language is really lagging behind. Stringr package is to solve this problem, so that the string processing becomes easy to use, provide a friendly string operation interface.

Stringr's Project home: https://cran.r-project.org/web/packages/stringr/index.html

2. Stringr Installation

The system environment used in this article

    • Win10 64bit
    • r:3.2.3 x86_64-w64-mingw32/x64 B4bit

Stringr is a standard library published in Cran, which is easy to install, with 2 commands.

~ R> install.packages(‘stringr‘)> library(stringr)
3. Stringr's API Introduction

Stringr Package 1.0.0 version, altogether provides 30 functions, facilitates us to the string processing. The processing of commonly used strings is named after the beginning of Str_, which makes it easier to understand the definition of a function more intuitively. We can classify functions according to usage habits:

string concatenation function

    • Str_c: string concatenation.
    • Str_join: string concatenation, same as str_c.
    • Str_trim: Remove the space and tab of the string (\ t)
    • Str_pad: The length of the supplemental string
    • Str_dup: Copying strings
    • Str_wrap: Control string output format
    • Str_sub: Intercepting strings
    • Str_sub<-intercepts strings and assigns values to the same str_sub

String calculation function

    • Str_count: String Count
    • Str_length: String length
    • Str_sort: Sorting String values
    • Str_order: String index Sort, rule same as Str_sort

String matching function

    • Str_split: String Segmentation
    • Str_split_fixed: String segmentation, same as Str_split
    • Str_subset: Returns a matching string
    • Word: Extracting words from text
    • Str_detect: Checking for characters that match a string
    • Str_match: Extracts a matching group from a string.
    • Str_match_all: Extracts a matching group from a string, with Str_match
    • Str_replace: String substitution
    • Str_replace_all: string substitution, same as Str_replace
    • Str_replace_na: Substituting na for na string
    • Str_locate: The location of the matching string is found.
    • Str_locate_all: Find the location of the matched string, same as Str_locate
    • Str_extract: Extracting matching characters from a string
    • Str_extract_all: Extracts matching characters from a string, same as Str_extract

String transformation functions

    • Str_conv: Character encoding conversion
    • Str_to_upper: string turns into uppercase
    • Str_to_lower: string converted to lowercase, rule same as Str_to_upper
    • Str_to_title: The string is capitalized in the first letter, the rule is the same as Str_to_upper

parameter control functions , which are used only to construct the parameters of the function, cannot be used independently.

    • Boundary: Defining usage boundaries
    • Coll: defines the string standard collation.
    • Fixed: Defines the characters used for matching, including escape characters in regular expressions
    • Regex: Defining Regular expressions

3.1 String concatenation function

3.1.1 Str_c, string concatenation operation, identical to Str_join, is not exactly the same as the paste () behavior.

function definition:

str_c(..., sep = "", collapse = NULL)str_join(..., sep = "", collapse = NULL)

Parameter list:

    • ...: Multi-parameter input
    • Sep: Concatenation of multiple strings into a large string, used for the delimiter of the string.
    • Collapse: Stitching multiple vector parameters into a large string, used for the delimiter of the string.

Stitching multiple strings into a large string.

> str_c(‘a‘,‘b‘)[1] "ab"> str_c(‘a‘,‘b‘,sep=‘-‘)[1] "a-b"> str_c(c(‘a‘,‘a1‘),c(‘b‘,‘b1‘),sep=‘-‘)[1] "a-b"   "a1-b1"

Stitching multiple vector parameters into a large string.

> str_c(head(letters), collapse = "")[1] "abcdef"> str_c(head(letters), collapse = ", ")[1] "a, b, c, d, e, f"# collapse参数,对多个字符串无效> str_c(‘a‘,‘b‘,collapse = "-")   [1] "ab"> str_c(c(‘a‘,‘a1‘),c(‘b‘,‘b1‘),collapse=‘-‘)[1] "ab-a1b1"

When stitching a string vector with Na values, na or NA

> str_c(c("a", NA, "b"), "-d")[1] "a-d" NA    "b-d"

Compare the differences between the str_c () function and the paste () function.

# 多字符串拼接,默认的sep参数行为不一致> str_c(‘a‘,‘b‘)[1] "ab"> paste(‘a‘,‘b‘)[1] "a b"# 向量拼接字符串,collapse参数的行为一致> str_c(head(letters), collapse = "")[1] "abcdef"> paste(head(letters), collapse = "")[1] "abcdef" #拼接有NA值的字符串向量,对NA的处理行为不一致> str_c(c("a", NA, "b"), "-d")[1] "a-d" NA    "b-d"> paste(c("a", NA, "b"), "-d")[1] "a -d"  

3.1.2 Str_trim: Remove the space and tab of the string (\ t)

function definition:

str_trim(string, side = c("both", "left", "right"))

Parameter list:

    • String: Strings, String vectors.
    • Side: Filter method, both both sides of the filter, left to filter, right filter

Remove the Space and tab of the string (\ t)

#只过滤左边的空格> str_trim("  left space\t\n",side=‘left‘) [1] "left space\t\n"#只过滤右边的空格> str_trim("  left space\t\n",side=‘right‘)[1] "  left space"#过滤两边的空格> str_trim("  left space\t\n",side=‘both‘)[1] "left space"#过滤两边的空格> str_trim("\nno space\n\t")[1] "no space"

3.1.3 Str_pad: Length of supplementary string

function definition:

str_pad(string, width, side = c("left", "right", "both"), pad = " ")

Parameter list:

    • String: Strings, String vectors.
    • Width: Length after string padding
    • Side: Fill direction, both both sides, left padding, right padding
    • Pad: The character used for padding

The length of the complement string.

# 从左边补充空格,直到字符串长度为20> str_pad("conan", 20, "left")[1] "               conan"# 从右边补充空格,直到字符串长度为20> str_pad("conan", 20, "right")[1] "conan               "# 从左右两边各补充空格,直到字符串长度为20> str_pad("conan", 20, "both")[1] "       conan        "# 从左右两边各补充x字符,直到字符串长度为20> str_pad("conan", 20, "both",‘x‘)[1] "xxxxxxxconanxxxxxxxx"

3.1.4 Str_dup: Copying strings

function definition:

str_dup(string, times)

Parameter list:

    • String: Strings, String vectors.
    • Times: Copy Quantity

Copies a string vector.

> val <- c("abca4", 123, "cba2")# 复制2次> str_dup(val, 2)[1] "abca4abca4" "123123"     "cba2cba2"  # 按位置复制> str_dup(val, 1:3)[1] "abca4"        "123123"       "cba2cba2cba2"

3.1.5 Str_wrap, controlling string output format

function definition:

str_wrap(string, width = 80, indent = 0, exdent = 0)

Parameter list:

    • String: Strings, String vectors.
    • Width: Sets the length of the row.
    • Indent: Indent value for first line of paragraph
    • Exdent: Indent value for non-first line of paragraph
 txt<-‘R语言作为统计学一门语言,一直在小众领域闪耀着光芒。直到大数据的爆发,R语言变成了一门炙手可热的数据分析的利器。随着越来越多的工程背景的人的加入,R语言的社区在迅速扩大成长。现在已不仅仅是统计领域,教育,银行,电商,互联网….都在使用R语言。‘# 设置宽度为40个字符> cat(str_wrap(txt, width = 40), "\n")R语言作为统计学一门语言,一直在小众领域闪耀着光芒。直到大数据的爆发,R语言变成了一门炙手可热的数据分析的利器。随着越来越多的工程背景的人的加入,R语言的社区在迅速扩大成长。现在已不仅仅是统计领域,教育,银行,电商,互联网….都在使用R语言。 # 设置宽度为60字符,首行缩进2字符> cat(str_wrap(txt, width = 60, indent = 2), "\n")  R语言作为统计学一门语言,一直在小众领域闪耀着光芒。直到大数据的爆发,R语言变成了一门炙手可热的数据分析的利器。随着越来越多的工程背景的人的加入,R语言的社区在迅速扩大成长。现在已不仅仅是统计领域,教育,银行,电商,互联网….都在使用R语言。 # 设置宽度为10字符,非首行缩进4字符> cat(str_wrap(txt, width = 10, exdent = 4), "\n")R语言作为    统计学一    门语言,    一直在小    众领域闪    耀着光芒。    直到大数据    的爆发,R    语言变成了    一门炙手可    热的数据分    析的利器。    随着越来    越多的工程    背景的人的    加入,R语    言的社区在    迅速扩大成    长。现在已    不仅仅是统    计领域,教    育,银行,    电商,互联    网….都在使    

3.1.6 Str_sub, intercepting strings

function definition:

str_sub(string, start = 1L, end = -1L)

Parameter list:

    • String: Strings, String vectors.
    • Start: Starting position
    • End: Ending position

Intercepts a string.

> txt <- "I am Conan."# 截取1-4的索引位置的字符串> str_sub(txt, 1, 4)[1] "I am"# 截取1-6的索引位置的字符串> str_sub(txt, end=6)[1] "I am C"# 截取6到结束的索引位置的字符串> str_sub(txt, 6)[1] "Conan."# 分2段截取字符串> str_sub(txt, c(1, 4), c(6, 8))[1] "I am C" "m Con" # 通过负坐标截取字符串> str_sub(txt, -3)[1] "an."> str_sub(txt, end = -3)[1] "I am Cona"

Assigns a value to the intercepted string.

> x <- "AAABBBCCC"# 在字符串的1的位置赋值为1> str_sub(x, 1, 1) <- 1; x[1] "1AABBBCCC"# 在字符串从2到-2的位置赋值为2345> str_sub(x, 2, -2) <- "2345"; x[1] "12345C"

3.2 String Calculation function

3.2.1 Str_count, String count

function definition:

str_count(string, pattern = "")

Parameter list:

    • String: Strings, String vectors.
    • Pattern: matches the character.

Count of matched characters in a string

> str_count(‘aaa444sssddd‘, "a")[1] 3

Count of matched characters in a string vector

> fruit <- c("apple", "banana", "pear", "pineapple")> str_count(fruit, "a")[1] 1 3 1 1> str_count(fruit, "p")[1] 2 0 1 3

The '. ' In the string Character count, because. is a regular expression of the match, the direct judgment of the result of the count is not correct.

> str_count(c("a.", ".", ".a.",NA), ".")[1]  2  1  3 NA# 用fixed匹配字符> str_count(c("a.", ".", ".a.",NA), fixed("."))[1]  1  1  2 NA# 用\\匹配字符> str_count(c("a.", ".", ".a.",NA), "\\.")[1]  1  1  2 NA

3.2.2 Str_length, String length

function definition:

str_length(string)

Parameter list:

    • String: Strings, String vectors.

To calculate the length of a string:

> str_length(c("I", "am", "张丹", NA))[1]  1  2  2 NA

3.2.3 Str_sort, sorting string values, sorting with Str_order indexes

function definition:

str_sort(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)str_order(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)

Parameter list:

    • X: String, String vector.
    • Decreasing: Sort direction.
    • The location of the Na_last:na value, a total of 3 values, true put to the last, false put to the front, NA filter processing
    • Locale: Sort by which language you are accustomed to

Sorts the string values.

# 按ASCII字母排序> str_sort(c(‘a‘,1,2,‘11‘), locale = "en")  [1] "1"  "11" "2"  "a" # 倒序排序> str_sort(letters,decreasing=TRUE)          [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h"[20] "g" "f" "e" "d" "c" "b" "a"# 按拼音排序> str_sort(c(‘你‘,‘好‘,‘粉‘,‘丝‘,‘日‘,‘志‘),locale = "zh")  [1] "粉" "好" "你" "日" "丝" "志"

Sort processing of NA values

 #把NA放最后面> str_sort(c(NA,‘1‘,NA),na_last=TRUE) [1] "1" NA  NA #把NA放最前面> str_sort(c(NA,‘1‘,NA),na_last=FALSE) [1] NA  NA  "1"#去掉NA值 > str_sort(c(NA,‘1‘,NA),na_last=NA)    [1] "1"

3.3 String Matching function

3.3.1 Str_split, String segmentation, with str_split_fixed

function definition:

str_split(string, pattern, n = Inf)str_split_fixed(string, pattern, n)

Parameter list:

    • String: Strings, String vectors.
    • Pattern: matches the character.
    • N: Number of partitions

Splits a string.

> val <- "abc,123,234,iuuu"# 以,进行分割> s1<-str_split(val, ",");s1[[1]][1] "abc"  "123"  "234"  "iuuu"# 以,进行分割,保留2块> s2<-str_split(val, ",",2);s2[[1]][1] "abc"          "123,234,iuuu"# 查看str_split()函数操作的结果类型list> class(s1)[1] "list"# 用str_split_fixed()函数分割,结果类型是matrix> s3<-str_split_fixed(val, ",",2);s3     [,1]  [,2]          [1,] "abc" "123,234,iuuu"> class(s3)[1] "matrix"

3.3.2 Str_subset: Matching string returned

function definition:

str_subset(string, pattern)

Parameter list:

    • String: Strings, String vectors.
    • Pattern: matches the character.
> val <- c("abc", 123, "cba")# 全文匹配> str_subset(val, "a")[1] "abc" "cba"# 开头匹配> str_subset(val, "^a")[1] "abc"# 结尾匹配> str_subset(val, "a$")[1] "cba"

3.3.3 Word, extracting words from text

function definition:

word(string, start = 1L, end = start, sep = fixed(" "))

Parameter list:

    • String: Strings, String vectors.
    • Start: Starting position.
    • End: the ending position.
    • Sep: Matches the character.
> val <- c("I am Conan.", "http://fens.me, ok")# 默认以空格分割,取第一个位置的字符串> word(val, 1)[1] "I"               "http://fens.me,"> word(val, -1)[1] "Conan." "ok"    > word(val, 2, -1)[1] "am Conan." "ok"       # 以,分割,取第一个位置的字符串 > val<-‘111,222,333,444‘> word(val, 1, sep = fixed(‘,‘))[1] "111"> word(val, 3, sep = fixed(‘,‘))[1] "333"

3.3.4 Str_detect matches the character of a string

function definition:

str_detect(string, pattern)

Parameter list:

    • String: Strings, String vectors.
    • Pattern: Match character.
> val <- c("abca4", 123, "cba2")# 检查字符串向量,是否包括a> str_detect(val, "a")[1]  TRUE FALSE  TRUE# 检查字符串向量,是否以a为开头> str_detect(val, "^a")[1]  TRUE FALSE FALSE# 检查字符串向量,是否以a为结尾> str_detect(val, "a$")[1] FALSE FALSE FALSE

3.3.6 Str_match, extracting matching groups from a string

function definition:

str_match(string, pattern)str_match_all(string, pattern)

Parameter list:

    • String: Strings, String vectors.
    • Pattern: Match character.

Extracting a matching group from a string

> val <- c("abc", 123, "cba")# 匹配字符a,并返回对应的字符> str_match(val, "a")     [,1][1,] "a" [2,] NA  [3,] "a" # 匹配字符0-9,限1个,并返回对应的字符> str_match(val, "[0-9]")     [,1][1,] NA  [2,] "1" [3,] NA  # 匹配字符0-9,不限数量,并返回对应的字符> str_match(val, "[0-9]*")     [,1] [1,] ""   [2,] "123"[3,] ""  

Extracts a matching group from a string and returns it in the string matrix format

> str_match_all(val, "a")[[1]]     [,1][1,] "a" [[2]]     [,1][[3]]     [,1][1,] "a" > str_match_all(val, "[0-9]")[[1]]     [,1][[2]]     [,1][1,] "1" [2,] "2" [3,] "3" [[3]]     [,1]

3.3.7 Str_replace, String substitution

function definition:

str_replace(string, pattern, replacement)

Parameter list:

    • String: Strings, String vectors.
    • Pattern: Match character.
    • Replacement: The character used for substitution.
> val <- c("abc", 123, "cba")# 把目标字符串第一个出现的a或b,替换为-> str_replace(val, "[ab]", "-")[1] "-bc" "123" "c-a"# 把目标字符串所有出现的a或b,替换为-> str_replace_all(val, "[ab]", "-")[1] "--c" "123" "c--"# 把目标字符串所有出现的a,替换为被转义的字符> str_replace_all(val, "[a]", "\1\1")[1] "\001\001bc" "123"        "cb\001\001"

3.3.8 Str_replace_na to replace na with Na string

function definition:

str_replace_na(string, replacement = "NA")

Parameter list:

    • String: Strings, String vectors.
    • Replacement: The character used for substitution.

Replace Na with a string

> str_replace_na(c(NA,‘NA‘,"abc"),‘x‘)[1] "x"   "NA"  "abc"

3.3.9 Str_locate, finds the position of the pattern in the string.

function definition:

str_locate(string, pattern)str_locate_all(string, pattern)

Parameter list:

    • String: Strings, String vectors.
    • Pattern: Match character.
> val <- c("abca", 123, "cba")# 匹配a在字符串中的位置> str_locate(val, "a")     start end[1,]     1   1[2,]    NA  NA[3,]     3   3# 用向量匹配> str_locate(val, c("a", 12, "b"))     start end[1,]     1   1[2,]     1   2[3,]     2   2# 以字符串matrix格式返回> str_locate_all(val, "a")[[1]]     start end[1,]     1   1[2,]     4   4[[2]]     start end[[3]]     start end[1,]     3   3# 匹配a或b字符,以字符串matrix格式返回> str_locate_all(val, "[ab]")[[1]]     start end[1,]     1   1[2,]     2   2[3,]     4   4[[2]]     start end[[3]]     start end[1,]     2   2[2,]     3   3

3.3.10 Str_extract extracting a matching pattern from a string

function definition:

str_extract(string, pattern)str_extract_all(string, pattern, simplify = FALSE)

Parameter list:

    • String: Strings, String vectors.
    • Pattern: Match character.
    • Simplify: Return value, True returns matrix,false return string vector
> val <- c("abca4", 123, "cba2")# 返回匹配的数字> str_extract(val, "\\d")[1] "4" "1" "2"# 返回匹配的字符> str_extract(val, "[a-z]+")[1] "abca" NA     "cba" > val <- c("abca4", 123, "cba2")> str_extract_all(val, "\\d")[[1]][1] "4"[[2]][1] "1" "2" "3"[[3]][1] "2"> str_extract_all(val, "[a-z]+")[[1]][1] "abca"[[2]]character(0)[[3]][1] "cba"

3.4 String Transformation functions

3.4.1 Str_conv: Character encoding conversion

function definition:

str_conv(string, encoding)

Parameter list:

    • String: Strings, String vectors.
    • Encoding: the encoding name.

Transcoding of Chinese is handled.

# 把中文字符字节化> x <- charToRaw(‘你好‘);x[1] c4 e3 ba c3# 默认win系统字符集为GBK,GB2312为GBK字集,转码正常> str_conv(x, "GBK")[1] "你好"> str_conv(x, "GB2312")[1] "你好"# 转UTF-8失败> str_conv(x, "UTF-8")[1] "???"Warning messages:1: In stri_conv(string, encoding, "UTF-8") :  input data \xffffffc4 in current source encoding could not be converted to Unicode2: In stri_conv(string, encoding, "UTF-8") :  input data \xffffffe3\xffffffba in current source encoding could not be converted to Unicode3: In stri_conv(string, encoding, "UTF-8") :  input data \xffffffc3 in current source encoding could not be converted to Unicode

Turn Unicode into UTF-8

> x1 <- "\u5317\u4eac"> str_conv(x1, "UTF-8")[1] "北京"

3.4.2 Str_to_upper, string capitalization conversion.

function definition:

str_to_upper(string, locale = "")str_to_lower(string, locale = "")str_to_title(string, locale = "")

Parameter list:

    • String: Strings.
    • Locale: Sort by which language you are accustomed to

String Capitalization conversions:

> val <- "I am conan. Welcome to my blog! http://fens.me"# 全大写> str_to_upper(val)[1] "I AM CONAN. WELCOME TO MY BLOG! HTTP://FENS.ME"# 全小写> str_to_lower(val)[1] "i am conan. welcome to my blog! http://fens.me"# 首字母大写> str_to_title(val)[1] "I Am Conan. Welcome To My Blog! Http://Fens.Me"

String is often used in the usual data processing, need to divide, connect, transform, and so on, in this article through the introduction of STRINGR, flexible string processing library, can effectively improve the coding efficiency. With good tools, working with strings in the R language is handy.

reproduced in: R Language String processing package Stringr

---------------------------------

Common features: # Merge strings Fruit <-C ("Apple", "banana", "pear", "pinapple") Res <-str_c (1:4,fruit,sep= ", collapse=") str_c (' I w  Ant to buy ', res,collapse= ') # Calculates the string length Str_length (c ("I", "like", "Programming R", 123,res)) # take substring by position str_sub (fruit, 1, 3) # substring re-assigned capital <-toupper (Str_sub (fruit,1,1)) str_sub (fruit, Rep (1,4), Rep (1,4)) <-Capital # repeating string Str_dup (fruit , C (1,2,3,4)) # Plus blank str_pad (fruit, ten, "both") # Remove blank Str_trim (fruit) #  Check if match str_detect (fruit, "a$") according to regular expression str_ Detect (Fruit, "[Aeiou]") # find the matching string position str_locate (fruit, "a") # Extract the matching section str_extract (fruit, "[a-z]+") Str_match (Fruit, "[ a-z]+ ") # replaces matched portions of str_replace (fruit," [Aeiou] ","-") # Split Str_split (res," ")

R----Stringr Package Introduction Learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.