This package is mainly used for data cleaning and finishing, Coursera course Links: getting and cleaning
You can also load the swirl package, load class getting and cleaning data to follow.
As follows:
Library (Swirl)
install_from_swirl ("Getting and cleaning Data")
swirl ()
This article is mainly refer to the introduction of R: Introduce to Dplyr
1. Model data
> Library (NYCFLIGHTS13)
> Dim (flights)
[1] 336776
> Head (Flights, 3)
source:local Data frame [3 x] year
month day Dep_time dep_delay arr_time Arr_delay carrier Tailnum flight Origin dest Air_time
1 2013 1 1 517 2 830 UA N14228 1545 EWR iah 227
2 2013 1 1 4 850 UA N24211 1714 LGA Iah
3 2013 1 1 542 2 923 AA n619aa 1141 JFK MIA 160
Variables not shown:distance (dbl), hour (dbl), Minute (dbl)
2, will be too long data to organize into friendly TBL_DF data
> FLIGHTS_DF <-tbl_df (flights)
> FLIGHTS_DF
3, filtering filter ()
> Filter (FLIGHTS_DF, month = 1, day = = 1)
source:local data frame [842 x] year
month Day Dep_time Y arr_time Arr_delay carrier Tailnum flight Origin dest Air_time
1 2013 1 1 517 2 830 UA N14228 1545 EWR iah
2 2013 1 1 4 850 UA N24211 1714 LGA iah 227
Filter out month=1 and Day=1 data
The same effect,
Flights_df[flights_df$month = = 1 & flights_df$day = 1,]
4, select a few rows of data slice ()
Slice (FLIGHTS_DF, 1:10)
5, arrange Arrange ()
>arrange (FLIGHTS_DF, year, month, day)
Arranges FLIGHTS_DF data in ascending order of Year,month,day.
Descending
>arrange (FLIGHTS_DF, year, DESC (month), day)
Self-band functions in R language
Flights_df[order (Flights$year, Flights_df$month, Flights_df$day),]
Flights_df[order (DESC (flights_df$arr_ delay)),]
6. Choose Select ()
Select the data you want by the column name
Select (FLIGHTS_DF, year, month, day)
Select three columns of data
Using: Symbol
Select (FLIGHTS_DF, Year:day)
Use-To remove a list of not
Select (FLIGHTS_DF,-(Year:day))
7. Deformation mutate ()
Create a new column
> Mutate (FLIGHTS_DF,
+ gain = Arr_delay-dep_delay,
+ speed = Distance/air_time * 60)
8, summary summarize ()
<pre name= "code" class= "HTML" >> summarise (flights,
+ delay = mean (dep_delay, na.rm = TRUE)
To find the mean value of Dep_delay
9, randomly selected samples
Sample_n (FLIGHTS_DF, 10)
Randomly selected 10 samples
Sample_frac (FLIGHTS_DF, 0.01)
Randomly selected 1% samples
10. Group Group_py ()
By_tailnum <-group_by (flights, tailnum)
#确定组别为tailnum, assigned to By_tailnum delay <-summarise
(By_tailnum,
count = N (),
dist = mean (distance, na.rm = true),
delay = mean (arr_delay, na.rm = TRUE)
#汇总flights里地tail The number of categories in the NUM group, and their corresponding distance and arr_delay mean
delay <-filter (delay, Count >, dist <)
Ggplot (Delay, AES (Dist, delay)) +
Geom_point (aes (size = count), alpha = 1/2) +
Geom_smooth () +
Scale_size_area ()
Results need to be stored by assigning
A1 <-group_by (flights, year, month, day) A2 <-Select (A1, Arr_delay, Dep_delay) A3 <-summarise
(a2,
arr = Mean (Arr_delay, na.rm = True),
dep = mean (Dep_delay, na.rm = true)
A4 <-filter (A3, arr > 30 | DEP > 30)
11, introduce the link character%>%
Use the data name as the starting point, and then do a multi-step operation on the data:
Flights%>%
group_by (year, month, day)%>%
Select (Arr_delay, Dep_delay)%>% summarise
(arr
= Mean (Arr_delay, na.rm = True),
dep = mean (Dep_delay, na.rm = True)
%>%
filter (arr > | dep > 30)
The data name is removed from the front.
For more information on this package, refer to its own instructions (60 pages): Dplyr