Based on the guarantee of data quality, the distribution and contribution of data are analyzed by drawing charts and calculating some statistics (Pareto analysis), distribution analysis can reveal the distribution characteristics and distribution types of data, and for quantitative data, we can make frequency distribution table and plot frequency distribution histogram display distribution characteristics. For qualitative data, a pie chart and a bar chart are available to show the distribution. Pareto analysis based on the frequency distribution histogram, draw the accumulative frequency and calculate the benefit of the input.
The following example uses the arthritis data set in the VCD package for data distribution analysis and Pareto analysis.
Library (GRID) library (VCD) head (arthritis) ID treatment Sex age improved1 $Treated Male -Some2 $Treated Male inNone3 theTreated Male -None4 -Treated Male +Marked5 $Treated Male $Marked6 atTreated Male -Marked
One, the distribution analysis of quantitative data
For quantitative data, make frequency distribution table, draw frequency distribution histogram. Selecting "group number" and "group width" are the most important problems encountered when doing frequency distribution analysis, generally follow the following 5 steps to achieve:
- range: Domain value = maximum-minimum
- determine group spacing and group Count : The group distance is the length of each interval, number of groups = range/Group distance
- decision Group Limit : The group limit refers to the endpoints of each interval, and this step is to determine the starting and ending points of each group
- List frequency Distribution tables
- Plotting frequency distribution histogram
The main principles to be followed when grouping are:
- Each group is mutually exclusive.
- Group spacing is equal for each group
(1) Production frequency distribution table
The frequency is calculated according to age, and every 10 is an age group, counting the number of people of all ages. Since there is no such categorical variable in the arthritis dataset, a custom interval is required to make the frequency distribution table at the interval of the grouping. The production process of frequency distribution table is described in detail in the article "R actual combat nineth: List and Frequency table", no longer repeat.
Library (GRID) library (VCD) labels<-C ("<","30-40","40-50","50-60","60-70",">=") Breaks<-C (1, -, +, -, -, -, -) MyTable<-cut (arthritis$age, breaks = breaks, labels = labels, right =TRUE) DF<- as. data.frame (Table (age=mytable)) DF<-transform (df, cumfreq = Cumsum (Freq), freqrate =prop.table (Freq)) DF<-transform (DF, cumfreqrate=cumsum (freqrate)) DF<-transform (df,freqrate=round (Freqrate * -,2), cumfreqrate= round (cumfreqrate* -,2))
(2) Plotting frequency distribution histogram
To draw a histogram of frequency distributions using Ggplot:
Ggplot (DATA=DF, Mapping=aes (X=factor (age), Y=freqrate,group=factor (age)) + geom_bar (stat=" Identity ") + Labs (Title=' agedistribution', x='agerange' , y='Freqrate') + theme_classic ()
Second, the analysis of the distribution of qualitative data
For qualitative variables, which are usually grouped according to the classification, and then statistics the frequency of the groupings or frequencies, you can use pie charts or bar charts to describe the distribution of qualitative data:
- Each pie segment represents a percentage or frequency of each type, and the pie chart is divided into sections according to the type of the qualitative variable, and each part is proportional to the frequency of each type;
- The height of the bar chart represents the percentage or frequency of each type, and the width of the bar chart is meaningless.
Plot the pie and bar charts according to the frequency of the improved variable:
MyTable <-as. Data.frame (MyTable)
1. Draw a bar chart
To draw a bar chart using Geom_bar:
Ggplot (data=df,mapping = AES (x=improved, y=freq,fill=improved)) +Geom_bar (Stat="Identity")+scale_fill_manual (Values=c ("#999999","#E69F00","#56B4E9"))+Labs (Title='Improved distribution', x='Improved', y='Freq')+Geom_text (Stat="Identity", AES (y=freq, label = Freq), size=4, Position=position_stack (Vjust =0.5))+Theme_classic ()
2, Draw pie chart
Using the Geom_bar () and Coord_polar () functions to draw a pie chart, typically, a pie chart shows a percentage, and the histogram shows the specific values of a category:
Blank_theme <-theme_minimal () +Theme (axis.title.x=Element_blank (), Axis.title.y=Element_blank (), Axis.text.x=Element_blank (), Axis.text.y=Element_blank (), Panel.border=Element_blank (), Panel.grid=Element_blank (), Axis.ticks=Element_blank (), Plot.title=element_text (size= -, face="Bold")) Ggplot (Data=DF, Mapping=aes (x="Improved", y=freq,fill=improved)) +Geom_bar (Stat="Identity", width=0.5, position='Stack', size=5)+Coord_polar ("y", start=0)+scale_fill_manual (Values=c ("#999999","#E69F00","#56B4E9"))+Blank_theme+Geom_text (Stat="Identity", AES (y=freq, label = scales::p ercent (Freq/sum (Freq))), size=4, Position=position_stack (Vjust =0.5))
Three, Pareto analysis
Pareto analysis is based on the principle of 20/80 law, 80% of the benefits are often from 20% of the input, while the other 80% of the input only produced 20% of the benefits, which shows that the same investment in different places will have different benefits.
It is feared that the drawing process of the Pareto chart is arranged according to the contribution degree from high to low, and the cumulative contribution degree curve is drawn. When the number of samples is large enough, the contribution usually presents a 20/80 distribution.
The script and diagram for the Pareto diagram drawn with Ggplot2 are as follows:
Library ( grid) library (VCD) library (GGPLOT2) library (scales) labels<-C ("<","30-40","40-50","50-60","60-70",">=") Breaks<-C (1, -, +, -, -, -, -) MyTable<-cut (arthritis$age, breaks = breaks, labels = labels, right =TRUE) DF<- as. data.frame (Table (age=mytable), stringsasfactors=FALSE) DF<-transform (df, freqrate =prop.table (Freq)) DF<-Df[order (df$freq,decreasing =TRUE),]rownames (DF)<-seq (nrow (DF)) Df$age<-Factor (df$age,levels=df$age) Df$cumrate<-cumsum (df$freqrate) df$cumratelable<- as. Character (Percent (Df$cumrate)) df$cumratelable[1] <-""Ggplot (DF, AES (x=age,y=freqrate,fill=age)) +Geom_bar (Stat="Identity", width =0.7) +Geom_text (Stat='Identity', AES (Label=percent (Freqrate)), vjust=-0.5, color="Black", size=3)+scale_y_continuous (Name="cum Freq rate", Limits=c (0,1.1), labels = function (x) paste0 (x* -,"%"))+Geom_point (Aes (Y=cumrate), Show.legend=false) +Geom_text (Stat="Identity", AES (Label=cumratelable,y=cumrate), vjust=-0.5, size=3)+Geom_path (Aes (Y=cumrate, group=1))
Reference Documentation: