Preface
When you draw a statistical graph, more than half of the time is spent on the data-shaping operation before invoking the drawing command. Because the data frame must be converted to the appropriate format before the data can be sent to the drawing function.
This article will give some basic tips on using the R language for data shaping, and more technical details are recommended in the "R Language Core Handbook".
Data frame Molded type
1. Create Data Frame-data.frame ()
# create vector pp = c ("A", "B", "C") # Create vector QQ = 1:3# Create data frame: With p/q two column dat = Data.frame (p, Q)
Results show:
2. View data frame Information-STR ()
# Show data set DAT information str (DAT)
Results show:
3. Adding columns to the data frame
The basic format is: Data frame $ new Column name = vector name. The following code creates a column named Newcol in the DAT dataset and assigns the vector v to it:
Dat$newcol = V
If the vector length is less than the number of rows in the data frame, R repeats the vector until all rows are filled.
4. Remove columns from the data frame
You can assign null to a column. The following code removes the Badcol column from the dataset:
Dat$badcol = NULL
You can also use the subset function (which will be specified later) and a minus sign to the column to be deleted:
DAT = subset (data, select =-badcol)
5. Rename the column names in the data frame
You can assign a column name vector to the names function:
Names (DAT) = C ("name1", "name2", "Name3")
If you want to rename a column by column name, you can do this:
# Rename the column named Ctrl to Cntrolnames (anthoming) [names (anthoming) = = "Ctrl"] = C ("Cntrol")
6. Reorder the columns of the data frame
You can reorder by numeric position:
# Reorder dat by column's numeric position = Dat[c (1,3,2)]
You can also reorder by the name of the column:
# Reorder dat by column name Dat[c ("Col1", "col3", "col2")]
7. Extracting a subset from the data frame-subset ()
The following R language code from the Climate Data box, select the "Year", "anomaly10y" two columns of the record with the source property "Berkeley":
# subset Function: First election set data set, source parameter selected row, select Candidate subset (climate, Source = = "Berkeley", select = C (year, anomaly10y))
factor-level molding
1. Change the factor level order based on the value of the data-reorder ()
The following example re-sorts the factor levels in the spray column based on the Count column and summarizes the data as mean:
# Reorder function: The first election to set the factor vector, the second election to sort by the data vector, fun parameters selected summary function Iss$spray = Reorder (Iss$spray, iss$count, fun = mean)
2. Change the name of the factor level-revalue ()/Mapvalues () in Plyr package
If the next two lines of R language code can be named "small" in the horizontal factor F, "medium", "large" The factors were renamed "S", "M", "L":
# method One F = Revalue (f, c (small = "s", medium = "M", large = "L")) # method Two F = Mapvalues (F, C ("small", "medium", "large"), C ("s", "M", "L"))
3. Remove any unused levels in the factor-droplevels ()
The following R language code will eliminate the excess levels in factor F:
Droplevels (f)
variable Shape
1. Variable substitution-match ()
To replace some values with other specific values, you can use the match function. The following R language code replaces the "Ctr1", "Trt1", and "Trt2" values in the oldvals of the group column of the data frame PG with "No", "Yes", and "yes" respectively:
# old Value oldvals = C ("Ctrl1", "Trt1", "Trt2") # new value Newvals = Factor (C ("No", "yes", "yes")) # Replace pg$treatment = Newvals[match (Pg$gro Up, Oldvals)]
2. Grouping conversion data-ddply () in Plyr package
By using the transform parameter function of the ddply () function, the data within different groupings can be converted. The following R code can group the cabbages data frame according to the cult column factor and create a new column named DEVWT in the data frame, which is obtained from the original column value minus the group mean value:
# ddply function: The first election to set the data frame, the next election to set the group variable, three candidates for the processing method, the new column of the output CB = ddply (cabbages, "Cult", transform, DEVWT = Headwt-mean (HEADWT))
3. Grouping summary data-ddply () in Plyr package
By using the transform parameter function of the ddply () function, the data within different groupings can be summarized. The difference between the summary and the conversions described above is that the number of records for the summary result is equal to the number of groupings, and the number of records is unchanged after the conversion operation, only the change of the original column is converted. The following R language code groups the cabbages data frame according to the cult and date column factors and creates a new column in the data frame named DEVWT, which is counted by the mean value for each grouping:
# ddply function: The first election to set the data frame, the next election to set the group variable, three candidates for the processing method, the new column of the output CB = Ddply (cabbages, C ("Cult", "date"), summarise, Weight = mean (HEADWT))
long/wide data shaping
1. Long data, wide data-melt () in Reshape2 package
The anthoming dataset looks like this:
where EXPT and ctrl two columns can be combined into one column. The merged data frame is called Long data before merging, and the data frame before merging is called wide data, is it appropriate?
The following R language code uses the MELT function to "elongate" the above data set:
# Melt function: The first election to set the data frame, the election of the Record Identification column, Variable.name selected the Elongated property name, Value.name selected the elongated attribute Value column melt (anthoming, id.vars = "angle", Variable.name = "condition", Value.name = "Count")
The effect after stretching:
2. Long data-wide data-dcast () in Reshape2 package
The plum dataset looks like this:
The length column and the time column in the data frame are used as identity columns, and the following R language code can flatten the data frame:
# Dcast function: The first election to set the data frame, the next election to set the record Identification column and the new attribute column, Value.var selected the elongated attribute value columns dcast (plum, length + Time ~ Survival, Value.var = "Count")
The effect after flattening:
Summary
Before invoking any image-drawing function, the data must be placed in accordance with the requirements of the drawing function, which is also known as data shaping. Some of the features of this article may be confusing to readers, so don't worry, get into the interesting drawing section first. As the number of drawings increases, you will understand slowly.
The second chapter: Data Shaping Technology of R language data visualization