Objective:
Learn two variable analysis flow by exploring file PSEUDO_FACEBOOK.TSV data
Knowledge Points:
1.ggplot syntax
2. How to make a scatter chart
3. How to optimize scatter plots
4. Condition mean value
5. Correlation of variables
6. Sub-hubs diagram
7. Smoothing
Brief introduction:
If you are exploring a single variable using a histogram to represent the relationship between the value and the whole, then using a scatter plot is more appropriate to explore the relationship between the two variables when exploring two variables
Case Analysis:
1. Make a scatter plot based on age and number of friends
#导入ggplot2绘图包
Library (Ggplot2) SETWD ('d:/udacity/Data Analysis Advanced/R')
# Load Data file PF<-Read.csv ('PSEUDO_FACEBOOK.TSV', sep='\ t')
# Use qplot syntax to make scatter plots qplot (x=age,y=friend_count,data=PF)
# make a scatter plot using the Ggplot syntax , using Ggplot to make the syntax clearer
Ggplot (AES (x=age,y=friend_count), DATA=PF) +Geom_point ()
Figure 2-1
2. Transition drawing, because the majority of the points in Figure 2-1 overlap, not very good to distinguish between the age and the number of friends, so use alpha and geom_jitter to adjust
# Geom_jitter Elimination of coincident points # ALPHA=1/20 represents 20 values of 1 points # Xlim (13,90) represents the x-axis value from 13,90Ggplot (Aes (X=age,y=friend_count), DATA=PF) + geom_jitter (Alpha=1/ + Xlim (13,90)
Figure 2-2
The use of the 3.coord_trans function can be applied to the axis to make it more visually effective.
# give y-axis friends The number of square root, make it more visual effect Ggplot (Aes (X=age,y=friend_count), DATA=PF) + geom_point (Alpha=1/20) + Xlim (13,90) + coord_trans (y="sqrt")
Figure 2-3
4. Conditional mean value, grouped according to field and then grouped to calculate new Dataframe
# 1. Import Dplyr Package # 2. Use Group_by to group age fields # 3. Use summarise to count the average and median # 4. Then use arrange to sort the library ('dplyr'<-pf %>% %>% summarise (friend_count_mean=mean (friend_count), = Median (friend_ Count), n=n ())%>% Arrange (age)
5. The data and the original data are superimposed, according to the graph, we can draw a trend, from the 13-26-year-old friends number in the increase, starting from 26 slowly the number of friends began to decline
#1. Make a scatter plot of age and friends by limiting the value of x, y#2. Asymptote to make the median value#3. Make 0.9 of Asymptote#4. Make 0.5 of Asymptote#5. Make 0.1 of AsymptoteGgplot (Aes (X=age,y=friend_count), DATA=PF) +Geom_point (Alpha=1/10, Position= Position_jitter (h=0), color='Orange')+Coord_cartesian (Xlim= C (13,90), Ylim = C (0,1000)) +Geom_line (Stat='Summary', Fun.y=mean) +Geom_line (Stat='Summary', Fun.y=quantile,fun.args=list (probs=.9), Linetype=2,color='Blue')+Geom_line (Stat='Summary', Fun.y=quantile,fun.args=list (probs=.5), Color='Green')+Geom_line (Stat='Summary', Fun.y=quantile,fun.args=list (probs=.1), Color='Blue', linetype=2)
Figure 2-4
6. Calculating correlations
# using the Cor.test function for calculations, you can actually partition the data set
#Pearson represents the parameter of the correlation strength between two variables, the closer the 1 the correlation is to 'Pearson')
With (subset (PF,AGE<=70), cor.test (Age,friend_count,method = ' Pearson ')
7. Strong correlation parameters, by making www_likes_received and likes_received scatter plots to determine the correlation degree of two variables, we can see that the correlation of two values is very large
# use quantile to limit some extreme values # filtering via Xlim and Ylim # also add a asymptote to see the overall value Ggplot (Aes (x=www_likes_received,y=likes_received), DATA=PF) + geom_point () + Xlim (0,quantile (pf$www_likes_received,0.95) + ylim (0,quantile (pf$likes_received, 0.95) +'lm', color='red' )
Figure 2-5
8. Make three line charts about the relationship between age and friend number by calculating the average age, average age and age distribution of a month
From this figure we can find the most detail of P1, P2 shows the number of friends for each age group, P3 shows the general trend of age and number of friends
#Library (Gridextra) Pf$age_with_month<-Pf$age + (12-pf$dob_month)/12pf.fc_by_age_months<-PF%>%group_by (age_with_months)%>%Summarise (Friend_count_mean=mean (friend_count), Friend_count_median=median (friend_count), n=n ())%>%Arrange (age_with_months) P1<-Ggplot (Aes (x=age_with_month,y=friend_count_mean), Data=subset (pf.fc_by_age_months,age_with_month<71)) +Geom_line ()+Geom_smooth () P2<-Ggplot (Aes (x=age,y=friend_count_mean), Data=subset (pf.fc_by_age,age<71)) +Geom_line ()+Geom_smooth () P3<-Ggplot (Aes (X=round (AGE/5) *5,y=friend_count), Data=subset (pf,age<71)) +Geom_line (Stat='Summary', fun.y=mean) grid.arrange (P1,p2,p3,ncol=1)
Exercises:
R language-Explore two variables