Data storage type base type R data Type R
The most basic type is the type that stores a single numeric value. Mainly include Numeric, Integer, Complex, Character, Logical and so on.
Digital
Numeric or "Double" is the method by which R prioritizes the stored value, which is equivalent to "double" in C. It is important to note that Numeric is sometimes considered to be "integer" and "double" collectively. The .Machine$double.double.eps equal variables give the limit of the storage double in the environment.
Integer is a whole number, which is equivalent to "int" in C. In general, regardless of the number of decimal points, R default to Save as Numeric, this time need to use the As.integer function to force to save the number as an integer. .Machine$integer.maxgives the largest integer that can be stored, always 2^31-1 = 2147483647.
Complex is the storage form of complex numbers.
Character
Character is the type that stores characters and strings. Both strings and characters can be stored.
Logic
Logical is the type that stores bool values, with only TRUE (T) and FALSE (F) two values.
Time
The date type is specifically designed for storage time. POSIXct Save the time as an integer, for a time distance of January 1, 1970. Posixlt saves a list with information about the day and time of the month. You can use Unclass to convert the corresponding class to a base class. The corresponding function has as. POSIXct, as. Posixlt, Strptime, Strftime, Isodate, Isodatetime, and Chron packs can handle time.
Data
Data is often not a single value, and R has a good structure for storing multi-valued data.
Vector
Vectors, vector, similar to a one-dimensional array, store the same base type, and if there is a character element, all values will be converted to character type. It is important to note that the length of 1 can also be a vector, so the use of the Is.vector function to judge will also be shown, that is TRUE , all the single basic element variable will be considered a vector. For vectors of different lengths, use class or mode and other functions to determine its type, will give the type of its basic elements, if all elements in a vector are character return character.
The strategy in this way is that the R language tries to ignore the difference between the singular and plural. In fact, the single data and vector types do not do too much to distinguish the characteristics of the R language is the response. The R language is used to process large amounts of data, so its construction and logic are more in line with this requirement. For example, if you add 1 to an integer vector that does not actually have a length of 1, you get a new vector with 1 for each value of the original vector.
x <- 1:6x + 1[1] 2 3 4 5 6 7
If the two equal-length vectors are added, the corresponding elements are added together.
x <- 1:6y <- 1:6xy[1] 2 4 6 8 10 12
If the length of the two vectors is unequal, then only one length is a multiple of another length to add. The short vector repeats as long as the length vector. In fact, it is understood that when the short vector length is 1, the result is actually a special case. This is also a feature of the difference between a single value and a vector in R language dilution.
x <- 1:6z <- 1:2xz[1] 2 4 4 6 6 8
Factor
Factor is a way to help save memory space. If there are more duplicate values in a series of values, you can use factor, only one copy of the original value is stored in the factor, and the original value itself is saved as a number, which saves space.
The original value is called the level, you can set the order yourself, and so on. The Levels function can return a factor of all possible level, while nlevels can return the number of level.
Convert a vector into factor as long as you use the As.factor function. Sometimes we need to convert the numbers first into factor, after certain processing needs to factor conversion is called the number, this time can not directly use As.numeric, because As.numeric will directly return the value of the factor inside, rather than the original value. We need to use As.character or levels first to get the characters and then convert them to numbers.
myfactor <-factor (C (10 , 20 , 20 , 50 , 20 , 10 ), Levels=c ( 10 , 20 , 50 ), Ordered=true ) as . numeric (Levels (Myfactor) [Myfactor]) as . numeric (as .) character (myfactor))
Sometimes we need to generate some kind of factor to do parameters or test data, so we can use the GL function. The GL function can be written as an abbreviation for "Generate levels". The main parameters of the GL function are: To set the number of level, to n k set the number of repetitions of each level length , to set the length, actually have the first two parameters this can be ignored; labels to set the value of level; ordered A ool value that sets whether level is arranged in the order.
It should be noted that in the use of C function to combine several factors, you need to convert the factor to the original value and then use the C function, otherwise the C function directly to the factor as the existence of memory of the number and lost the original meaning.
If I have a vector, which is a continuous value, I want to draw a histogram now, I can draw it directly with the corresponding function. I can use the cut function if I don't need to see the graph and just want to know how the values are distributed across the range. The cut function divides the values into different intervals and then converts the original vectors into a factor of level interval, so that you can know which interval a value belongs to. The length of the original vector and factor is equal, and the level of the factor is set by itself. Using the table function, you can count the number of values in each interval.
aaa <- c(1,2,3,4,5,2,3,4,5,6,7)cut(aaa, 3)[1] (0.994,3] (0.994,3] (0.994,3] (3,5] (3,5] (0.994,3] (0.994,3][8] (3,5] (3,5] (5,7.01] (5,7.01] Levels: (0.994,3] (3,5] (5,7.01]cut(aaa, 3, dig.lab = 4, ordered = TRUE)[1] (0.994,3] (0.994,3] (0.994,3] (3,5] (3,5] (0.994,3] (0.994,3][8] (3,5] (3,5] (5,7.006] (5,7.006]Levels: (0.994,3] < (3,5] < (5,7.006]
Sometimes, I need to know how many combinations of two factors, this time can choose interaction function, interaction function can give a combination of multiple factor level. These combinations do not all have data, and if the settings drop = TRUE throw away the level without data, only the level that really has the data is preserved.
a<-GL (2, 4, 8)b<-GL (2, 2, 8, labels = C ("Ctrl", "Treat"))interaction (A, B, drop = TRUE, Sep = ".")[1]1. Ctrl1. Ctrl1. Treat1. Treat2. Ctrl2. Ctrl2. Treat2. TreatLevels: 1. CTRL 2. CTRL 1. Treat 2. Treat
Matrix
Matrix, the two-dimensional array, all the elements are the same type. As.matrix, Is.matrix.
When taking elements from the matrix, you can use subscripts to manipulate them. In general, the subscript is taken by [row, Col], where both row and col can be vectors, either a vector indicating the line number or column number you want to take out, or a vector of bool values. If there is no comma in the square brackets and the elements in the matrix are taken by [NUM], then the value of the corresponding position of the matrix as a vector of one dimension is returned, and if it is a 2-by-2 matrix, [3] returns the value of [2,1], and the original matrix is stretched to a vector according to the column precedence.
When a row or column of a matrix is taken, the dimension of its return value is reduced, and when the label is removed, the use of the parameter [,, drop = FALSE] does not reduce the dimension of the value of the result taken by the matrix.
A matrix in memory is a one-dimensional vector that is stored in line-first or column-first, so if you need to build a matrix, it is best to build a matrix that is large enough to fill it up, instead of building a small matrix and then using Rbind or cbind to supplement it. Because if the number of matrices increases, R needs to reapply for space, and if the added rows or columns are not the same as the storage precedence, then the matrix elements need to be sorted again, which makes the efficiency very low. So, building a matrix should build a matrix that is large enough, and if it does not need such a large matrix at the end, just re-assign the value once.
Array
Array, which can have many dimensions.
List
List, lists, can combine different types of variables, list can also contain sub list.
Taking the elements in the list requires special attention, if you use the square brackets "[]", the result is a sub-list of the list, and if you want to get its own content, you need to use both brackets "[[]]" or the dollar sign "$". You can use a name for an element, or you can use a number.
mylist <-list (one = "one" , both = C (2 , 2 )) mylist$one[ 1 ] "one" $two [1 ] 2 2 mylist[ " One "] $one [1 ] " one " mylist[[" one "]" [1 ] "one" mylist$ One[1 ] "one"
Because various types of objects can be placed in a list, this facilitates the integration of a wide variety of related data. A list can be inaccurate as a class that is specifically used to store data in the C + + language. Some related variables, in order to distinguish with other variables, often take similar variable names, when the variables are very many times, such a method is still inconvenient to query. We can put these related variables in a list and then access the variable by removing the underlying method. If you forget the variable name, you can also query the name of the variable contained in the list by using the names function or the STR function. This method is ideal for saving the same or similar processing results for multiple datasets, and you can use a for loop to complete the data save.
rna.gene.fpkm <- list() # 需要提前建立空的列表.forin dir(rna.cuffnorm.result.dir)) { rna.gene.fpkm[[nam]] <- read.table( # 建立列表元素并赋值 file.path("./", nam, "genes.fpkm_table"), header = TRUE, sep "\t")}
Lists are not contiguous in memory, but are scattered like linked lists in C, so adding elements to a list is not as inefficient as matrices, and using Rbind for lists or corresponding data frames is not as slow as using matrices. That's why we can create an empty list (or data frame), and then incrementally add elements to it in a way that's for loops.
Data frame
Data.frame, a data frame, is a special list that, like the matrix, restricts the variable length of each column to be the same, but also like a list, the variable type of each column can be different. The data frame actually looks very much like a table in our usual Excel, and the column and row names of the data frame correspond to the column and row names in the Excel table, respectively. Because the data frame has a list and matrix features, so the data frame to take elements of the list and matrix to take elements, we can either use the "$" as a list to take a column, or like a matrix using "[ROW, COL]" to take its A specific value in the.
Types of query variables
Common functions for querying variable types are: Mode, Storage.mode, class, and TypeOf. There are some differences between these functions.
- Storage.mode is the way data is actually stored in memory.
- Class is an object-oriented R, such as data.frame in fact the storage mode (Storage.mode) is list, but in order to better process the form data, the wrapper becomes the Data.frame type.
- The results given by mode and typeof are very close to the actual type, but in mode the "integer" and "Double" are considered "numeric".
If you need to know the size that every basic type in the environment can store, you can query .Machine the list, and the corresponding size is stored in the list.
NA
For a variety of reasons, missing values may be present in the data, and R will replace the corresponding null value with NA. If the null value is generated after the calculation, the INF or NaN is used instead. Handling null values is a necessary part of data analysis.
- The Is.na or Is.nan function can be used to determine Na, where is.na will consider Na, Inf, Nan as Na, while Is.nan only focuses on Nan.
- Mean, Var, sum, Min, Max, and other functions all have na.rm parameters, set to True, then the NA will be removed when calculating.
- LM, GLM, GAM and other functions have na.action parameters, which accept functions as variables, such as Na.omit, Na.fail. Na.pass, Na.exculde and so on.
- Both Na.omit and complete.cases can return a data.frame that contains only full rows of data, meaning that if there is one or more NA in a row, the row is rejected.
- For functions such as read.table, you can use na.strings to think of a specific value or character as Na.
R data type