R Language Programming Art (2) data structures in R

Source: Internet
Author: User
Tags scalar

This article corresponds to the "R Language Programming Art" chapter 2nd: Vectors; Chapter 3rd: matrices and Arrays; chapter 4th: list; 5th: Data frame; 6th chapter: Factors and tables


The most basic data type of R language is vector, and a single value and matrix are a special case of vectors.

Declaration: There is no need to declare variables in R, but pay attention to the characteristics of the functional language, if you read and write elements in the vector, R does not know that the object is a vector, then the function does not have an object to execute. The following code is not working:

Y[1] <-5y[2] <-12


Loop completion:

When using operators on two vectors, if the two vectors are required to have the same length, R automatically loops over (recycle), which repeats the shorter vector until it matches the length of another vector.

It is important to note that the matrix is actually a long vector, (1, 2, 3, 4, 5, 6) in the form of a matrix:

[1 4

2 5

3 6]

Common Vector Operations:

Vector and logical operations: Note that R is a functional language, and that each operator is a function, so regardless of whether the +-*/is an element-by-element operation, special attention is paid to the difference between the matrix operations in linear algebra and the multiplication of elements and elements.

Vector Index: The index format is: vector 1[Vector 2], the returned result is vector 1 in the index of those elements of vector 2, note that the element allows repetition. A negative subscript means that you want to reject the corresponding element.

Use the: operator to create a vector that generates a vector of numeric values within a specified range. Note: The precedence of operators is higher than the general operators, which can be entered in the command window? Syntax view.

Create a vector with seq (): Generate a linear sequence

x <-C () #比较下面两句代码for (i in 1:length (x)) to (I in SEQ (x)) #第一句返回的i = C (0, 1), apparently different from Hope # The second sentence returns the i is null


Use the rep () repeating vector constant: You can create a vector of times*length (x) elements by calling rep (X, Times), which is repeated by x, or by calling rep (x, each) to create a each*length (x) The vector of the elements, alternating by the x by each of the two components.

> Rep (c (5, B, D), 3) > [1]  5  5  13> Rep (c (5, 5, 12), each = 2) > [1]  5  5  13


Use all () and any ()

These two functions report whether their arguments have at least one or all of them as true

Vectorization operators:

Vector input, vector output: Many functions and operators are vectorized, and attention is applied flexibly to improve code efficiency. This non-scalar language (scalar is actually a vector of length 1) brings some code security problems: A custom function has a return value when it needs to input a scalar, but there is no hint. This requires that the problem be considered when designing the function, and whether the input is illegal to judge.

F <-function (x, c) {                    if (length (c)! = 1) Stop ("Vector c not allowed")                    return ((x + c) ^ 2)}


Vector input, matrix output: When a function is used to enter the return value of a value itself is a vector, the input of a vector should be returned to the matrix. Instead of using a function directly, the result is a one-dimensional vector that needs to be re-integrated using the matrix () function. But there is another way to do this, that is, the sapply () function (simplify apply abbreviation), the call format is sapply (x, f) input vector x applies the F () function to each of these elements and turns the result into a matrix.

Na and Null values:

NA exists but unknown value, null denotes nonexistent value, is a special object of R, no pattern.

Filter (filtering):

To generate a filtered index: a condition that ends with a Boolean value

Filter by using the subset () function: The difference between a conditional-generated filtered index is how the NA value is handled, the normal processing retains the NA value, and the subset () function removes the NA value

Select function which (): Similar to the subset () function, but the return value is the position that matches the condition value (that is, the index number)

The IfElse () function for vectorization:

Invocation form: IfElse (b, U, v), where B is a boolean vector, U, v is a vector. The return value of the function is a vector, if b[i] is true, the first element of the return value is U[i] if b[i] is false, the first element of the return value is V[i].

You can use the IfElse () function to re-encode vectors, and for more than 2 encodings, consider nesting:

The #ifelse () function is nested, the G in M, F, I are re-encoded as 1, 2, 3g <-C ("M", "F", "F", "I", "M", "M", "F") ifelse (g = = "M", 1, ifelse (g = = "F", 2, 3))


Test vectors are equal:

Consider the following code:

x <-1:2y <-C (1, 2) x = = y# return value: True  true  because "= =" is a function that returns the result of vectorization all (x = = y) #返回值: TRUE Because all of the judging vectors are trueidentical (x, y) #返回值: FALSE because the identical () function determines whether two objects are exactly the same typeof (X) #integertypeof (y) #double


The name of the vector element:

The name () function can specify or query the names of vector elements:

x <-C (1, 2, 4) #命名names (x) <-C ("A", "B", "AB") #查询name (x) #返回值 "a"  "B"  "AB"


Names can be used to index elements in a vector

Some things to note about the C () function:

If the arguments passed to the C () function have different types, they will be demoted to the same type, which preserves their common characteristics to the maximum;

The C () function has a flattened effect on vectors:

C (5, 2, C (1.5, 6)) # [1]  5.0  2.0  1.5  6.0



Special cases of vectors: matrices and arrays

A matrix is a special vector that contains two additional properties compared to a vector: the number of rows and columns, whereas an array is a more general matrix, and a high-dimensional array contains more than two properties of the number of rows and columns.

To create a matrix:

Consider the following code:

> y <-Matrix (C (1, 2, 3, 4), Nrow = 2, Ncol = 2) > y      [, 1] [, 2][1,]    1     3[2,]    2     4> m <- Matrix (C (1, 2, 3, 4), Nrow = 2, Byrow = TRUE) > M      [, 1] [, 2][1,]     1     2[2,]     3     4


It is important to note that when the matrix M is generated, the data is populated by rows (that is, the data entry order), while R is still stored as a column when stored.

General matrix Operations:

Linear algebra operations: attention to matrix multiplication using "%*%"

Matrix Index: Similar to the use of vector indexes, you can extract, assign, and delete a sub-matrix.

Matrix element filtering: Similar to vector filtering, filtering by conditional calculation of Boolean values requires careful avoidance of accidental dimensionality reduction.

To call a function on the rows and columns of a matrix:

Use the Apply () function: Call the General format:

Apply (M, DimCode, F, Fargs)


The function called by the Apply () function returns a vector containing k elements, then the result is returned by default with K rows, which can be processed using the Transpose function T () if necessary. M is a matrix; DimCode is a dimension number: 1 applies a function to a row, 2 applies a function to a column, F is an applied function, and Fargs is an optional parameter set of F.

Note that the Apply () function does not necessarily make the program run faster. The advantage is that it makes the program compact, easy to read and modify, and avoids the possibility of creating a bug when using circular statements.

To add or remove rows or columns from a matrix:

To delete, assign the corresponding row or column A value of NULL, or use the "-" index (the usage of the reference vector); To increase the row or column, use the Rbind () function or the Cbind () function.

Be careful not to use the Rbind () function or the Cbind () function in a loop, because repeating the creation of a new matrix reduces program speed, so this is undesirable. A better solution is to create a large matrix before the loop starts, and the matrix is assigned by row-by-column in the loop, thus avoiding time-consuming matrix memory allocations during the loop.

Vector vs. Matrix differences:

The matrix is also a vector, so you can use the length () function to find the lengths. On the other hand, from the perspective of object-oriented programming, Matrix class is actually present. You can use the dim () function to access the properties (number of rows and columns) of a matrix class, and you can access the number of rows and columns of the matrix (in effect, a simple encapsulation of the dim () function) with the Nrow () and Ncol () functions. These two functions are generally used to write common library functions with matrix parameters, and can enter the number of rows and columns of the matrix without the need for additional parameters.

Avoid accidental dimensionality reduction:

Two ways: If you use the index to extract the sub-matrix, set the parameter drop = False (Note that "[" is actually a function, drop is a parameter of the function); If you choose to extract the sub-matrices first, you can use the As.matrix () function to convert the objects of the descending Wi Cheng vector into a matrix object.

Z <-Matrix (C (1, 2, 3, 4, 5, 6, 7, 8), Nrow = 4) #索引方式设置drop参数防止降维r <-z[2, drop = FALSE] #使用as. Matrix () function u <-z[ 2,]v <-as.matrix (u)


Naming problems for the rows and columns of a matrix:

Rownames () function and colnames () function

High-dimensional arrays:

Take a simple three-dimensional array as an example:

The #先生成两个矩阵, as the first and second layers of the array firsttest <-matrix (C (46, 41, 50, 43, 3), Secondtest <-Matrix (C (), Nrow Nrow = 3) #生成一个三维数组, three digits of the Dim parameter represent the number of rows, columns, and layers tests <-array (data = C (Firsttest, secondtest), Dim = C (3, 2, 2))



Fundamentals of data frame and object-oriented programming: List

The list in R is similar to a dictionary in Python, a hash table in Perl, and a struct (struct) type in C.

To create a list:

#创建一个简单的列表j <-List (name = "Joe", salary = 55000, union = TRUE) #使用标签的时候, in cases where ambiguity is not possible, the j$sal# list is actually a vector, which can be used vector () function Create z <-vector (mode = "List") z[["abc"] <-3


General actions for the list:

List index: Note the following code:

#提取列表组件三种方法j1 <-j$salaryj2 <-j[["Salary"]]j3 <-j[[2]] #以上方法效果相同, are the second component in the Extract list J, the type of the return value is the type of the component itself # Extract sub-list J4 < -J[salary]j5 <-j[2] #以上两种方法效果相同, extracts a sub-list of list J, the type of the return value is the list


To add or remove a list element:

Add list elements: Use the index directly to increase the list components, there are 5 ways to see the code above, you can add a single component, or you can add a sub-list as a list of multiple components.

Delete List element: Assigns the component to be deleted directly to NULL. Note that when you delete an intermediate component, the index of the subsequent component is all minus 1

Get list Length:

The length () function can get the number of list components. Because the list is a vector.

Access list elements and values:

The function names () can get the label of each element of the list;

The function unlist () can get the value of the list, the return value is a vector, and the type maximizes the common characteristics of all elements. In general, the order of precedence for various types is null<raw< logical type < integer < real type < plural type < List < expression (Pairing list (pairlist) as a normal list)

Use the Apply Series function on the list:

The use of the lapply () function and the sapply () function:

The Lapply () (for list apply) function is similar to the Apply () function of a matrix, executes a given function on each component of a list (or a vector cast to a list), and returns another list.

In some cases, the list returned by the lapply () function can be converted to the form of a matrix or vector. At this time you can choose to use Sapply () (for simplified [l]apply)

Recursive list:

Lists are recursive (recursive), which is a list of components that can also be lists.

The stitching function C () has an optional parameter recursive, which determines whether or not the original list is "flattened" when the list is stitched together, which is a vector that extracts all elements of the component.

> C (List (a = 1, b = 2, c = List (d = 5, E = 9))) $a [1]  1 $b [1]  2$c$c$d[1]  5 $c $e[1]  9 > C (List (a = 1, b = 2, c = List (d = 5, E = 9)), recursive = TRUE) a   b  c.d   c.e1   2    5    9



Data frame

Visually, the data frame is similar to a matrix, with rows and columns of two dimensions, whereas a data frame differs from a matrix in that each column of a data frame can be a different pattern. On a technical level, a data frame is a list of equal lengths for each component.

To create a data frame:

Note the use of the parameter Stringsasfactors = False.

Access Data frame: three ways:

#类似列表的方式访问组件d [[1]]d$kids#-like Matrix-by-column access d[, 1]


The STR () function can view the internal structure of the data frame. Note that all three of these methods are returned in a consistent way, with access to a column of the data frame.

In general, it is safer to use a name index, but a matrix notation is often used when writing R packets.

Other matrix-type operations:

Extract sub-data frame: The data frame can be considered as a row and column, so you can extract the sub-data frame by row or column. Similarly, if you need to avoid accidental dimensionality reduction, you need to set drop = FALSE

A <-examquiz[2:5, 2]b <-examquiz[2:5, 2, drop = False]class (a) # "Numeric" class (b) # "Data.frame"


Use functions such as Rbind () and Cbind (): When adding new rows, the rows can be either a data frame or a list, requiring the same number of rows. To add a new column, you can use the List property of the data frame to add, note that if the new column length is different from the data frame, it will automatically be recycled. Handling of missing values: Sometimes it is necessary to explicitly set na.rm = True to explicitly handle missing values, otherwise the function will return the result na. Flexible use of the subset () function for conditional filtering, default na.rm = TRUE. Alternatively, if you just delete the missing values, use the complete.cases () function to filter the complete observation as a condition.

Use the Apply () function: If the data type of each column of the data frame is the same, you can use the Apply () function on the data frame (you can now consider the data frame as a matrix).

Merge data frame:

Merge () function, you can combine two tables according to the values of a common variable.

It is important to note that when choosing a matching variable, be careful when there are duplicate values within a variable, it is likely to produce the wrong result (equivalent to one-to-many).

Functions applied to the data frame:

Apply the Lapply () and sapply () functions on the data frame: The data frame is a special case of the list, and the columns of the data frame make up the component of the list. Apply the Lapply () function on the data frame, and the specified function is f (). The f () function is used for each column of the data frame, and then the return value is placed in a list.


Factors and tables

The design idea of factor (factor) is derived from the nominal variable (nominal variables) in statistics, or called categorical variable (categorical variables), the value of which is not a number, but corresponds to a classification.

The table in this chapter is the general name of the frequency table and the list of tables, and will explore some common operations.

Factors and levels:

In R, a factor can simply be seen as a vector with more information attached. This additional information includes records of different values in the vector, called "horizontal".

The length of the factor is defined as the length of the data, not the number of levels.

If you anticipate other levels in the future, you need to insert them in advance, otherwise inserting new data to insert new levels is not feasible.

Common functions of the factor:

Tapply () Function: Call mode: Tapply (x, F, g). where x is a factor vector; F is a list of factors or factors; G is a function. The actions performed by the tapply () function are: (temporarily) grouping x, each group corresponding to a factor level (or a combination of factors in the case of multiple factors), to get the sub-vectors of x, and then these sub-vectors apply the function g ().

> #tapply () Application example > Ages <-C (at +, +, Notoginseng, +) > Affils <-C ("R", "D", "D", "R", "U", "D") > Tapply (AG ES, affils, mean) D    R    U41   21


The above is a simple form that is categorized by only one set of factors. If two or more factor combinations are required as a control condition, you simply need to replace F with a list of factors combined:

#假设数据框d中含有三列, income, gender, Over25, after both for control conditions to income apply mean function tapply (d$income, List (D$gender, d$over25), mean)


Split () function: Splits a vector into groups, which is equivalent to the first step of the tapply () function, and omits subsequent application function operations. Basic invocation form: Split (x, F), where x can be a vector or a data frame (the tapply () function cannot be a data frame), and F is a list of factors or factors. The return value is a list.

By () function: In a way similar to the tapply () function, they are grouped first, and then the function is called for each group. The tapply () function requires that the input data must be a vector, and the by () function can be a data frame or matrix.

#以Gender为控制条件, regression analysis of columns 2nd and 3rd, respectively ABA <-read.csv ("Abalone.data", Header = TRUE) by (ABA, Aba$gender, function (m) LM (m[, 2 ] ~ m[, 3]))


Table operations: Tables (Frequency tables and lists) are usually created using the table () function

Calculate boundary Value: Addmargins () function

Get dimension name and horizontal value: Dimnames () function

Tables can also be expressed in the form of a data frame, using the As.data.frame () function to

Other functions related to factors and tables:

Aggregate () function: Call the tapply () function on each variable in the group

Cut () function: is a common method of generating factors, especially those commonly used in table operations. Call mode: Cut (x, b, labels = FALSE) input vector x, defined by Vector b a set of intervals, return x each element falls into the interval of the vector. In simple terms, it is the re-coding, where the interval B is usually left open right closed interval.

R Language Programming Art (2) data structures in R

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.