R language-Explore two variables

Source: Internet
Author: User
Tags square root ggplot

Objective:

Learn two variable analysis flow by exploring file PSEUDO_FACEBOOK.TSV data

Knowledge Points:

1.ggplot syntax

2. How to make a scatter chart

3. How to optimize scatter plots

4. Condition mean value

5. Correlation of variables

6. Sub-hubs diagram

7. Smoothing

Brief introduction:

If you are exploring a single variable using a histogram to represent the relationship between the value and the whole, then using a scatter plot is more appropriate to explore the relationship between the two variables when exploring two variables

Case Analysis:

1. Make a scatter plot based on age and number of friends

#导入ggplot2绘图包
Library (Ggplot2) SETWD ('d:/udacity/Data Analysis Advanced/R')
# Load Data file PF<-Read.csv ('PSEUDO_FACEBOOK.TSV', sep='\ t')
# Use qplot syntax to make scatter plots qplot (x=age,y=friend_count,data=PF)
# make a scatter plot using the Ggplot syntax , using Ggplot to make the syntax clearer
Ggplot (AES (x=age,y=friend_count), DATA=PF) +Geom_point ()

Figure 2-1

2. Transition drawing, because the majority of the points in Figure 2-1 overlap, not very good to distinguish between the age and the number of friends, so use alpha and geom_jitter to adjust

# Geom_jitter Elimination of coincident points # ALPHA=1/20 represents 20 values of 1 points # Xlim (13,90) represents the x-axis value from 13,90Ggplot (Aes (X=age,y=friend_count), DATA=PF) +  geom_jitter (Alpha=1/ +  Xlim (13,90)

Figure 2-2

The use of the 3.coord_trans function can be applied to the axis to make it more visually effective.

# give y-axis friends The number of square root, make it more visual effect Ggplot (Aes (X=age,y=friend_count), DATA=PF) +  geom_point (Alpha=1/20) +   Xlim (13,90) +  coord_trans (y="sqrt")

Figure 2-3

4. Conditional mean value, grouped according to field and then grouped to calculate new Dataframe

# 1. Import Dplyr Package # 2. Use Group_by to group age fields # 3. Use summarise to count the average and median # 4. Then use arrange to sort the library ('dplyr'<-pf    %>% %>%  summarise (friend_count_mean=mean (friend_count),            = Median (friend_ Count),            n=n ())%>%  Arrange (age)

5. The data and the original data are superimposed, according to the graph, we can draw a trend, from the 13-26-year-old friends number in the increase, starting from 26 slowly the number of friends began to decline

#1. Make a scatter plot of age and friends by limiting the value of x, y#2. Asymptote to make the median value#3. Make 0.9 of Asymptote#4. Make 0.5 of Asymptote#5. Make 0.1 of AsymptoteGgplot (Aes (X=age,y=friend_count), DATA=PF) +Geom_point (Alpha=1/10, Position= Position_jitter (h=0), color='Orange')+Coord_cartesian (Xlim= C (13,90), Ylim = C (0,1000)) +Geom_line (Stat='Summary', Fun.y=mean) +Geom_line (Stat='Summary', Fun.y=quantile,fun.args=list (probs=.9), Linetype=2,color='Blue')+Geom_line (Stat='Summary', Fun.y=quantile,fun.args=list (probs=.5), Color='Green')+Geom_line (Stat='Summary', Fun.y=quantile,fun.args=list (probs=.1), Color='Blue', linetype=2)

Figure 2-4

6. Calculating correlations

# using the Cor.test function for calculations, you can actually partition the data set
#Pearson represents the parameter of the correlation strength between two variables, the closer the 1 the correlation is to 'Pearson')
With (subset (PF,AGE<=70), cor.test (Age,friend_count,method = ' Pearson ')

7. Strong correlation parameters, by making www_likes_received and likes_received scatter plots to determine the correlation degree of two variables, we can see that the correlation of two values is very large

# use quantile to limit some extreme values # filtering via Xlim and Ylim # also add a asymptote to see the overall value Ggplot (Aes (x=www_likes_received,y=likes_received), DATA=PF) +  geom_point () +  Xlim (0,quantile (pf$www_likes_received,0.95) + ylim (0,quantile  (pf$likes_received, 0.95)  +'lm', color='red' )

Figure 2-5

8. Make three line charts about the relationship between age and friend number by calculating the average age, average age and age distribution of a month

From this figure we can find the most detail of P1, P2 shows the number of friends for each age group, P3 shows the general trend of age and number of friends

#Library (Gridextra) Pf$age_with_month<-Pf$age + (12-pf$dob_month)/12pf.fc_by_age_months<-PF%>%group_by (age_with_months)%>%Summarise (Friend_count_mean=mean (friend_count), Friend_count_median=median (friend_count), n=n ())%>%Arrange (age_with_months) P1<-Ggplot (Aes (x=age_with_month,y=friend_count_mean), Data=subset (pf.fc_by_age_months,age_with_month<71)) +Geom_line ()+Geom_smooth () P2<-Ggplot (Aes (x=age,y=friend_count_mean), Data=subset (pf.fc_by_age,age<71)) +Geom_line ()+Geom_smooth () P3<-Ggplot (Aes (X=round (AGE/5) *5,y=friend_count), Data=subset (pf,age<71)) +Geom_line (Stat='Summary', fun.y=mean) grid.arrange (P1,p2,p3,ncol=1)

Exercises:

R language-Explore two variables

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.