Data remodeling typically uses the Reshape2 package, which is used to convert between wide data and long data, because the RESHAPE2 package is not in the default installation package for R, and needs to be installed and referenced before first use:
Install.packages ("Reshape2") library (reshape2)
Reshape the data by first merging the data (melt) so that each row has a unique identifier-variable combination, and then reshape (cast) the data to any shape you want. During the reshaping process, you can use any function to consolidate the data, or you can convert the long format to a wide format, which is similar to Excel's perspective and inverse perspective.
one, recognize the wide data
Create sample data that shows a wide format called Data , also called wide data :
> ID <-C (1,1,2,2) > Time <-C (1,2,1,2) > X1 <-C (5,3,6,2) > X2 <-c (6,5,1,4) > Myda Ta <- data.frame (id,time,x1,x2)
As shown in the wide format, the combination of ID and time is unique, and X1 and X2 are the observed variable values for that line:
ID time X1 X21 1 1 5 6 2 1 2 3 5 3 2 1 6 1 4 2 2 2 4
two, fusion data
Data set fusion refers to refactoring a dataset into a specific format: each observation variable has a single row, each row has a unique identification variable to identify each observation, and we use the melt () function to dissolve the data frame:
Melt (data,id.vars,measure.vars,variable.name='variable',..., na.rm=false,value.name= ' value ', factorasstrings=true)
Parameter comment:
- Data: A converged frame
- Id.vars: A vector of identified variables used to identify observed variables
- Measure.vars: vectors consisting of observed variables
- Variable.name: The name of the variable used to hold the original variable name
- Value.name: The name used to hold the original value
example, the identity variables are IDs and time,x1 and X2 as observation variables:
MD <-Melt (Mydata,id=c ("ID","time"), Measure=c (" X1","X2"))
After data fusion, it becomes the so-called long format, also known as Long data:
ID Time Variable Value1 1 1X152 1 2X133 2 1X164 2 2X125 1 1X266 1 2X257 2 1X218 2 2X24
Note: You must specify the variables (ID and time) that are required to uniquely determine each observation, and variables that represent the names of the observed variables (X1 and X2) are created automatically by the program, and as you can see from the results, the function automatically creates two variables: variable and value, which are the default names. This can be defined in the melt () function, through the parameters variable.name= "New_variable_name" and Value.name= "New_value_name".
MD <-Melt (Mydata,id=c ("ID","time"), Measure=c (" X1","X2""measuredvariable" "intvalue")
Third, reshape the data
The Dcast () function reads the fused data frame (d refers to data frame) and reshapes the data set using formula and the functions used to consolidate the data.
Dcast (data, formula, fun.aggregate = NULL, ..., margins = NULL, subset = NULL, fill = null, drop = TRUE, v Alue.var = Guess_value (data))
Parameter comment:
- Data: A converged frame
- Formula: a result set format for specifying output
- Fun.aggregate: Used to specify aggregate functions to perform aggregation operations on aggregated data
- Margins: equivalent to row totals and column totals in a pivot table
- Subset: Select data that satisfies some specific value, which is equivalent to the filter for the Excel pivot table. For example, subset =. (Variable = = "Length")
The format of the parameter formula is:
Rowvar1 + rowvar2 + ... ~ colvar1 + colvar2 + ...
In this formula, Rowvar defines the reserved variable names to uniquely determine the contents of each row, and Colvar defines the variable names that need to be reshaped to determine the values of each column. The meaning of remodeling is: According to Rowvar, expand Colvar and perform an aggregate operation on value (when fun.aggregate is an aggregate function).
1, expand Colvar
The process of expanding Colvar is actually the process of converting a column value to a column name, which is determined by the formula parameter.
The special case in the reshaping operation is the inverse operation of data fusion, which transforms the long format of data into the wide format of data, that is, converting the fused data into the original data format, for this operation, the format of the formula parameter is fixed: the identity variable ~variable.
> dcast (md,id+time~variable) ID time X1 X21 1 1 5 1 2 3 / 2 1 6 2 2 2 4
2, the observed variables are aggregated
The average of the observed variables is computed by ID:
> Dcast (md,id~variable,mean) ID X1 X21 1 4 5.52 2 4 2.5
This operation, similar to the grouping aggregation: Group by ID, calculates the aggregate values of the variables X1 and X2, respectively.
3, add a total column
Calculates the mean of the X1 and X2 grouped by ID and calculates the mean values of each column according to the ID of the remodeling, and calculates the mean of each row according to X1 and X2.
> dcast (md,id~variable,mean,margins = C ("ID","variable" )) ID X1 X2 (All)1 1 45.5 4.75 2 2 42.5 3.253 (All) 4 4.0 4.00
The process of calculation is:
Calculates the mean value of each column by ID: The value of X1 is (5.5+2.5)/2=4
Calculates the mean of each row by variable: The mean value of the first row is (4+5.5)/2=4.75
Reference Documentation:
Data reverse perspective and pivot using RESHAPE2 package
R language Study 13th: Reshaping Data with reshape2 packages