Apply function in R language

Source: Internet
Author: User

Objective

At the beginning of the contact with the R language, you will hear a variety of R language use skills, the most important one is not to use the loop, the efficiency is particularly low, to use vector computing instead of cyclic calculation.

So, what is this for? The reason is that R's cyclic operation for and while, are based on the R language itself, and the vector operation is based on the underlying C language function implementation, from the performance point of view, there will be a more obvious gap. So how to use the C function to achieve vector computing, is to use the family functions of apply, including apply, sapply, tapply, mapply, lapply, rapply, vapply, eapply and so on.

Directory

    1. The family function of apply
    2. Apply function
    3. lapply function
    4. sapply function
    5. vapply function
    6. mapply function
    7. tapply function
    8. rapply function
    9. eapply function
1. The family function of apply

The family of apply function is a set of core functions of data processing in R language, and by using the Apply function, we can realize the operation of looping, grouping, filtering, type controlling and so on. However, since the application function in R language is completely different from that of other language loops, the Apply function family has always been a core function for the user to play without turning.

A lot of new R language, write a lot of for loop code, also do not want to spend a little more time to understand the use of the application function, and finally write the R code with c like, I seriously despise only write for the R programmer.

The Apply function itself is to solve the problem of data loop processing, in order to face different data types, different return values, the Apply function consists of a function family, including 8 functions similar functions. Some of these functions are similar and some are not the same.

My most commonly used functions are apply and sapply, and the definitions and usage of the 8 functions are described below.

2. Apply function

The Apply function is the most commonly used instead of a For loop function. The Apply function can iterate over a matrix, a data frame, an array (two-dimensional, multidimensional), a row or a column, iterate over a pair of elements, and pass the element to a custom fun function in the form of an argument, and return the computed result.

function definition:

apply(X, MARGIN, FUN, ...)

Parameter list:

    • X: Array, matrix, data frame
    • MARGIN: Calculated by row or by column, 1 means by row, 2 is column, 3 indicates
    • Fun: a custom call function
    • ...: more parameters, optional

For example, to sum up each line of a matrix, use apply to do the loop.

> x<-matrix(1:12,ncol=3)> apply(x,1,sum)[1] 15 18 21 24

Here is an example of a slightly more complex point, looping through rows, adding 1 to the X1 column of the data frame, and calculating the mean of the x1,x2 column.

# 生成data.frame> x <- cbind(x1 = 3, x2 = c(4:1, 2:5)); x     x1 x2[1,]  3  4[2,]  3  3[3,]  3  2[4,]  3  1[5,]  3  2[6,]  3  3[7,]  3  4[8,]  3  5# 自定义函数myFUN,第一个参数x为数据# 第二、三个参数为自定义参数,可以通过apply的‘...‘进行传入。> myFUN<- function(x, c1, c2) {+   c(sum(x[c1],1), mean(x[c2])) + }# 把数据框按行做循环,每行分别传递给myFUN函数,设置c1,c2对应myFUN的第二、三个参数> apply(x,1,myFUN,c1=‘x1‘,c2=c(‘x1‘,‘x2‘))     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8][1,]  4.0    4  4.0    4  4.0    4  4.0    4[2,]  3.5    3  2.5    2  2.5    3  3.5    4

This is accomplished by using the above custom function myfun, a common loop calculation.

If implemented directly with a for loop, then the code is as follows:

# 定义一个结果的数据框> df<-data.frame()# 定义for循环> for(i in 1:nrow(x)){+   row<-x[i,]                                         # 每行的值+   df<-rbind(df,rbind(c(sum(row[1],1), mean(row))))   # 计算,并赋值到结果数据框+ }# 打印结果数据框> df  V1  V21  4 3.52  4 3.03  4 2.54  4 2.05  4 2.56  4 3.07  4 3.58  4 4.0

The above calculation can also be implemented easily through a for loop, but there are some additional operations that need to be handled yourself, such as building a loop body, defining a result dataset, and closing the results of each loop to the result dataset.

For the above requirements, there is a third approach, that is, to complete the use of R characteristics, by vectorization calculation to complete.

> data.frame(x1=x[,1]+1,x2=rowMeans(x))  x1  x21  4 3.52  4 3.03  4 2.54  4 2.05  4 2.56  4 3.07  4 3.58  4 4.0

Then, one line can complete the entire calculation process.

Next, we need to compare the performance overhead of the 3 operations.

# 清空环境变量> rm(list=ls())# 封装fun1> fun1<-function(x){+   myFUN<- function(x, c1, c2) {+     c(sum(x[c1],1), mean(x[c2])) +   }+   apply(x,1,myFUN,c1=‘x1‘,c2=c(‘x1‘,‘x2‘))+ }# 封装fun2> fun2<-function(x){+   df<-data.frame()+   for(i in 1:nrow(x)){+     row<-x[i,]+     df<-rbind(df,rbind(c(sum(row[1],1), mean(row))))+   }+ }# 封装fun3> fun3<-function(x){+   data.frame(x1=x[,1]+1,x2=rowMeans(x))+ }# 生成数据集> x <- cbind(x1=3, x2 = c(400:1, 2:500))# 分别统计3种方法的CPU耗时。> system.time(fun1(x))用户 系统 流逝 0.01 0.00 0.02 > system.time(fun2(x))用户 系统 流逝 0.19 0.00 0.18 > system.time(fun3(x))用户 系统 流逝    0    0    

From a CPU's time-consuming perspective, the calculations implemented with a for loop take the longest time, and the loop that is implemented by the apply takes a short time, and the operations that are directly using the R language's built-in vectors are hardly time consuming. Through the above test, for the same computation, the first consideration of the R language built-in vector calculation, must be used in the loop when using the Apply function, should try to avoid the use of for,while and other operating methods.

3. lapply function

The Lapply function is one of the most basic cyclic operation functions used to loop the list, Data.frame data sets and return the same list structure as the X-length as the result set, which can be judged by the first letter ' L ' at the beginning of the lapply to determine the type of the returned result set.

function definition:

lapply(X, FUN, ...)

Parameter list:

    • X:list, Data.frame data
    • Fun: a custom call function
    • ...: more parameters, optional

For example, calculate the number of bits of data that each key in the list should have.

# 构建一个list数据集x,分别包括a,b,c 三个KEY值。> x <- list(a = 1:10, b = rnorm(6,10,5), c = c(TRUE,FALSE,FALSE,TRUE));x$a [1]  1  2  3  4  5  6  7  8  9 10$b[1]  0.7585424 14.3662366 13.3772979 11.6658990  9.7011387 21.5321427$c[1]  TRUE FALSE FALSE  TRUE# 分别计算每个KEY对应该的数据的分位数。> lapply(x,fivenum)$a[1]  1.0  3.0  5.5  8.0 10.0$b[1]  0.7585424  9.7011387 12.5215985 14.3662366 21.5321427$c[1] 0.0 0.0 0.5 1.0 1.0

Lapply can easily iterate the list dataset and loop through columns with the Data.frame dataset, but if the incoming dataset is a vector or matrix object, then using lapply directly will not achieve the desired effect.

For example, the column of a matrix is summed.

# 生成一个矩阵> x <- cbind(x1=3, x2=c(2:1,4:5))> x; class(x)     x1 x2[1,]  3  2[2,]  3  1[3,]  3  4[4,]  3  5[1] "matrix"# 求和> lapply(x, sum)[[1]][1] 3[[2]][1] 3[[3]][1] 3[[4]][1] 3[[5]][1] 2[[6]][1] 1[[7]][1] 4[[8]][1] 5

Lapply each value in the matrix separately, instead of grouping by row or column.

If the columns of the data frame are summed.

> lapply(data.frame(x), sum)$x1[1] 12$x2[1] 12

Lapply automatically groups data frames by column and then calculates them.

4. sapply function

The Sapply function is a simplified version of the lapply,sapply that adds 2 parameters simplify and use.names, mainly to make the output look friendlier and return a vector instead of a list object.

function definition:

sapply(X, FUN, ..., simplify=TRUE, USE.NAMES = TRUE)

Parameter list:

    • X: Array, matrix, data frame
    • Fun: a custom call function
    • ...: more parameters, optional
    • Simplify: Whether array, when the value array, the output results are grouped by array
    • Use. NAMES: If x is a string, true sets the string to the data name, False does not set

We also use the above lapply to explain the calculation requirements.

> x <- cbind(x1=3, x2=c(2:1,4:5))# 对矩阵计算,计算过程同lapply函数> sapply(x, sum)[1] 3 3 3 3 2 1 4 5# 对数据框计算> sapply(data.frame(x), sum)x1 x2 12 12 # 检查结果类型,sapply返回类型为向量,而lapply的返回类型为list> class(lapply(x, sum))[1] "list"> class(sapply(x, sum))[1] "numeric"

If Simplify=false and Use.names=false, then the full sapply function is equal to the lapply function.

> lapply(data.frame(x), sum)$x1[1] 12$x2[1] 12> sapply(data.frame(x), sum, simplify=FALSE, USE.NAMES=FALSE)$x1[1] 12$x2[1] 12

For simplify as an array, we can refer to the following example to construct a three-dimensional array, where two dimensions are square.

> a<-1:2# 按数组分组> sapply(a,function(x) matrix(x,2,2), simplify=‘array‘), , 1     [,1] [,2][1,]    1    1[2,]    1    1, , 2     [,1] [,2][1,]    2    2[2,]    2    2# 默认情况,则自动合并分组> sapply(a,function(x) matrix(x,2,2))     [,1] [,2][1,]    1    2[2,]    1    2[3,]    1    2[4,]    1    2

For vectors of strings, data names can also be generated automatically.

> val<-head(letters)# 默认设置数据名> sapply(val,paste,USE.NAMES=TRUE)  a   b   c   d   e   f "a" "b" "c" "d" "e" "f" # USE.NAMES=FALSE,则不设置数据名> sapply(val,paste,USE.NAMES=FALSE)[1] "a" "b" "c" "d" "e" "f"
5. vapply function

Vapply, similar to Sapply, provides a fun.value parameter that controls the row name of the return value, which makes the program more robust.

function definition:

vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

Parameter list:

    • X: Array, matrix, data frame
    • Fun: a custom call function
    • Fun. Value: Defines the row name of the return value Row.names
    • ...: more parameters, optional
    • Use. NAMES: If x is a string, true sets the string to the data name, False does not set

For example, aggregate the data in the data frame and set the row name for each row row.names

# 生成数据集> x <- data.frame(cbind(x1=3, x2=c(2:1,4:5)))# 设置行名,4行分别为a,b,c,d> vapply(x,cumsum,FUN.VALUE=c(‘a‘=0,‘b‘=0,‘c‘=0,‘d‘=0))  x1 x2a  3  2b  6  3c  9  7d 12 12# 当不设置时,为默认的索引值> a<-sapply(x,cumsum);a     x1 x2[1,]  3  2[2,]  6  3[3,]  9  7[4,] 12 12# 手动的方式设置行名> row.names(a)<-c(‘a‘,‘b‘,‘c‘,‘d‘)> a  x1 x2a  3  2b  6  3c  9  7d 12 12

By using vapply you can directly set the return value of the row name, which can actually save a line of code, so that the code looks smoother, of course, if you are not willing to remember a function, then you can ignore it directly, only with sapply is enough.

6. mapply function

Mapply is also a sapply transformation function, similar to multivariable sapply, but some changes in parameter definitions. The first parameter is a custom fun function, and the second parameter ' ... ' can receive multiple data, called as a parameter of the fun function.

function definition:

mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE,USE.NAMES = TRUE)

Parameter list:

    • Fun: a custom call function
    • ...: Receive multiple data
    • Moreargs: Parameter list
    • Simplify: Whether array, when the value array, the output results are grouped by array
    • Use. NAMES: If x is a string, true sets the string to the data name, False does not set

For example, compare 3 vector sizes and take larger values by index order.

> set.seed(1)# 定义3个向量> x<-1:10> y<-5:-4> z<-round(runif(10,-5,5))# 按索引顺序取较大的值。> mapply(max,x,y,z) [1]  5  4  3  4  5  6  7  8  9 10

Another example is the generation of 4 data sets that match the normal distribution, and the corresponding mean and variance are C (1,10,100,1000).

> set.seed(1)# 长度为4> n<-rep(4,4)# m为均值,v为方差> m<-v<-c(1,10,100,1000)# 生成4组数据,按列分组> mapply(rnorm,n,m,v)          [,1]      [,2]      [,3]       [,4][1,] 0.3735462 13.295078 157.57814   378.7594[2,] 1.1836433  1.795316  69.46116 -1214.6999[3,] 0.1643714 14.874291 251.17812  2124.9309[4,] 2.5952808 17.383247 138.98432   955.0664

Since Mapply is capable of receiving multiple parameters, we do not need to merge the data into Data.frame when we do the data operation, and we can calculate the result directly once.

7. tapply function

The tapply is used for grouping of circular calculations, which can be grouped by the index parameter, which is equivalent to the group by operation.

function definition:

tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)

Parameter list:

    • X: Vector
    • Index: The indexes used for grouping
    • Fun: a custom call function
    • ...: Receive multiple data
    • Simplify: Whether array, when the value array, the output results are grouped by array

For example, calculate the mean of the petal (Iris) length of different varieties of irises.

# 通过iris$Species品种进行分组> tapply(iris$Petal.Length,iris$Species,mean)    setosa versicolor  virginica      1.462      4.260      

The vectors x and y are computed and grouped with the vector T as the index, summing.

> set.seed(1)# 定义x,y向量> x<-y<-1:10;x;y [1]  1  2  3  4  5  6  7  8  9 10 [1]  1  2  3  4  5  6  7  8  9 10# 设置分组索引t> t<-round(runif(10,1,100)%%2);t [1] 1 2 2 1 1 2 1 0 1 1# 对x进行分组求和> tapply(x,t,sum) 0  1  2  

Since tapply only receives a vector reference, passing ' ... ' can be passed on to you again the other parameters, then we want to go to the Y-vector also sum, the y as the 4th parameter of the tapply to calculate.

> tapply(x,t,sum,y) 0  1  

The result is not in line with our expectations, and the result is not the sum of the t corresponding to X and Y, but other results. The 4th parameter y passed in sum, not according to the loop one pass in, but each pass the complete vector data, then the sum is executed sum (y) = 55, so for t=0, X=8 plus y=55, the final result is 63. Well, when we use ' ... ' to pass in other parameters, be sure to look at the description of the delivery process so that the error on the algorithm will not occur.

8. rapply function

Rapply is a recursive version of lapply, which processes only the list type data, recursively iterates through each element of the list, and continues the traversal if the list includes child elements.

function definition:

rapply(object, f, classes = "ANY", deflt = NULL, how = c("unlist", "replace", "list"), ...)

Parameter list:

    • Object:list data
    • F: Custom Call function
    • Classes: Match type, any for all types
    • Deflt: Default value for non-matching types
    • How:3 mode of operation, when replace, the result of calling F replaces the original list element; When a list is created, a new list is called, the type match calls the F function, the mismatch assignment is Deflt, and when it is unlist, it is executed once unlist ( recursive = TRUE) operation
    • ...: more parameters, optional

For example, the data of a list is filtered, and all digital numeric data are sorted from small to large.

> x=list(a=12,b=1:4,c=c(‘b‘,‘a‘))> y=pi> z=data.frame(a=rnorm(10),b=1:10)> a <- list(x=x,y=y,z=z)# 进行排序,并替换原list的值> rapply(a,sort, classes=‘numeric‘,how=‘replace‘)$x$x$a[1] 12$x$b[1] 4 3 2 1$x$c[1] "b" "a"$y[1] 3.141593$z$z$a [1] -0.8356286 -0.8204684 -0.6264538 -0.3053884  0.1836433  0.3295078 [7]  0.4874291  0.5757814  0.7383247  1.5952808$z$b [1] 10  9  8  7  6  5  4  3  2  1> class(a$z$b)[1] "integer"

From the results found that only $z$a data are sorted, check the type of $z$b, found to be integer, is not equal to numeric, so there is no sorting.

Next, the data of the string type is manipulated, all string literals are added as a string ' ++++ ', and non-string type data is set to NA.

> rapply(a,function(x) paste(x,‘++++‘),classes="character",deflt=NA, how = "list")$x$x$a[1] NA$x$b[1] NA$x$c[1] "b ++++" "a ++++"$y[1] NA$z$z$a[1] NA$z$b[1] NA

Only $x$c is a string vector, merging a new string. Then, with rapply, you can easily filter the data of the list type.

9. eapply function

Iterates over all the variables in an environment space. If we have a good habit of storing custom variables in a custom environment space according to certain rules, then this function will make your operation very convenient. Of course, many people may not be familiar with the operation of space, then please refer to the article uncover the mystery of environment space in R language, decrypt the environment space of R language function.

function definition:

eapply(env, FUN, ..., all.names = FALSE, USE.NAMES = TRUE)

Parameter list:

    • ENV: Ambient Space
    • Fun: a custom call function
    • ...: more parameters, optional
    • All.names: Match type, any for all types
    • Use. NAMES: If x is a string, true sets the string to the data name, False does not set

Below we define an environment space and then loop through the variables of the environment space.

# 定义一个环境空间> env# 向这个环境空间中存入3个变量> env$a <- 1:10> env$beta <- exp(-3:3)> env$logic <- c(TRUE, FALSE, FALSE, TRUE)> env# 查看env空间中的变量> ls(env)[1] "a"     "beta"  "logic"# 查看env空间中的变量字符串结构> ls.str(env)a :  int [1:10] 1 2 3 4 5 6 7 8 9 10beta :  num [1:7] 0.0498 0.1353 0.3679 1 2.7183 ...logic :  logi [1:4] TRUE FALSE FALSE TRUE

Calculates the mean value of all variables in the env environment space.

> eapply(env, mean)$logic[1] 0.5$beta[1] 4.535125$a[1] 5.5

The memory size of all variables in the current environment space are recalculated.

# 查看当前环境空间中的变量> ls() [1] "a"     "df"     "env"    "x"     "y"    "z"    "X"  # 查看所有变量的占用内存大小> eapply(environment(), object.size)$a2056 bytes$df1576 bytes$x656 bytes$y48 bytes$z952 bytes$X1088 bytes$env56 bytes

Eapply functions are difficult to use at ordinary times, but for R package development, the use of environment space must be mastered. Especially when R is to be the tool of industrialization, it is necessary to control and manage the variable accurately.

This paper comprehensively introduces the application function family of the data loop processing in R language, which is basically able to deal with all the cyclic processing situations. At the same time, in the Apply section also compares, 3 kinds of data processing aspect performance, R's built-in vector computation, is better than the Apply loop, greatly surpasses for the For loop. Then we should use the application function more well in the process of development and use of R.

Forget the programmer's thinking, and change the thinking of the data, maybe you'll be cheerful all of a sudden.

The Apply function is often used to calculate the mean, and value functions of rows or columns in a matrix, as follows:
Apply (x, calculate row or column number code, function), see example:
> b
First Second
One 1 2
3 4
Three 5 6
> Apply (b,1,sum) #第一个参数表示要计算的矩阵, the second parameter 1 represents the calculation of each row, the third parameter is the function to calculate each row, here is the and of each row.
One and three
3 7 11
> Apply (b,2,sum) #表示求每一列的和.
First Second
9 12

> D<-array (1:24,dim=c (2,3,4))
> D
,, 1

[, 1] [, 2] [, 3]
[1,] 1 3 5
[2,] 2 4 6

,, 2

[, 1] [, 2] [, 3]
[1,] 7 9 11
[2,] 8 10 12

,, 3

[, 1] [, 2] [, 3]
[1,] 13 15 17
[2,] 14 16 18

,, 4

[, 1] [, 2] [, 3]
[1,] 19 21 23
[2,] 20 22 24
> Apply (d,3,sum) # denotes the sum of each dimension, a dimension is a matrix, that is, each element of this dimension is combined with.
[1] 21 57 93 129
This article is from: NPC Economic Forum, R Language Forum Edition, detailed source reference: http://bbs.pinggu.org/forum.php?mod=viewthread&tid=4200726&page=1

Apply function in R language

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.