Data analysis Third: Data feature analysis (distribution analysis + Pareto analysis)

Source: Internet
Author: User
Tags ggplot

Based on the guarantee of data quality, the distribution and contribution of data are analyzed by drawing charts and calculating some statistics (Pareto analysis), distribution analysis can reveal the distribution characteristics and distribution types of data, and for quantitative data, we can make frequency distribution table and plot frequency distribution histogram display distribution characteristics. For qualitative data, a pie chart and a bar chart are available to show the distribution. Pareto analysis based on the frequency distribution histogram, draw the accumulative frequency and calculate the benefit of the input.

The following example uses the arthritis data set in the VCD package for data distribution analysis and Pareto analysis.

Library (GRID) library (VCD) head (arthritis) ID treatment Sex age improved1  $Treated Male -Some2  $Treated Male inNone3  theTreated Male -None4  -Treated Male +Marked5  $Treated Male $Marked6  atTreated Male -Marked
One, the distribution analysis of quantitative data

For quantitative data, make frequency distribution table, draw frequency distribution histogram. Selecting "group number" and "group width" are the most important problems encountered when doing frequency distribution analysis, generally follow the following 5 steps to achieve:

    • range: Domain value = maximum-minimum
    • determine group spacing and group Count : The group distance is the length of each interval, number of groups = range/Group distance
    • decision Group Limit : The group limit refers to the endpoints of each interval, and this step is to determine the starting and ending points of each group
    • List frequency Distribution tables
    • Plotting frequency distribution histogram

The main principles to be followed when grouping are:

    • Each group is mutually exclusive.
    • Group spacing is equal for each group

(1) Production frequency distribution table

The frequency is calculated according to age, and every 10 is an age group, counting the number of people of all ages. Since there is no such categorical variable in the arthritis dataset, a custom interval is required to make the frequency distribution table at the interval of the grouping. The production process of frequency distribution table is described in detail in the article "R actual combat nineth: List and Frequency table", no longer repeat.

Library (GRID) library (VCD) labels<-C ("<","30-40","40-50","50-60","60-70",">=") Breaks<-C (1, -, +, -, -, -, -) MyTable<-cut (arthritis$age, breaks = breaks, labels = labels, right =TRUE) DF<- as. data.frame (Table (age=mytable)) DF<-transform (df, cumfreq = Cumsum (Freq), freqrate =prop.table (Freq)) DF<-transform (DF, cumfreqrate=cumsum (freqrate)) DF<-transform (df,freqrate=round (Freqrate * -,2), cumfreqrate= round (cumfreqrate* -,2))

(2) Plotting frequency distribution histogram

To draw a histogram of frequency distributions using Ggplot:

Ggplot (DATA=DF, Mapping=aes (X=factor (age), Y=freqrate,group=factor (age)) +  geom_bar (stat=" Identity ") +  Labs (Title=' agedistribution', x='agerange' , y='Freqrate') +  theme_classic ()

Second, the analysis of the distribution of qualitative data

For qualitative variables, which are usually grouped according to the classification, and then statistics the frequency of the groupings or frequencies, you can use pie charts or bar charts to describe the distribution of qualitative data:

    • Each pie segment represents a percentage or frequency of each type, and the pie chart is divided into sections according to the type of the qualitative variable, and each part is proportional to the frequency of each type;
    • The height of the bar chart represents the percentage or frequency of each type, and the width of the bar chart is meaningless.

Plot the pie and bar charts according to the frequency of the improved variable:

MyTable <-as. Data.frame (MyTable)

1. Draw a bar chart

To draw a bar chart using Geom_bar:

Ggplot (data=df,mapping = AES (x=improved, y=freq,fill=improved)) +Geom_bar (Stat="Identity")+scale_fill_manual (Values=c ("#999999","#E69F00","#56B4E9"))+Labs (Title='Improved distribution', x='Improved', y='Freq')+Geom_text (Stat="Identity", AES (y=freq, label = Freq), size=4, Position=position_stack (Vjust =0.5))+Theme_classic ()

2, Draw pie chart

Using the Geom_bar () and Coord_polar () functions to draw a pie chart, typically, a pie chart shows a percentage, and the histogram shows the specific values of a category:

Blank_theme <-theme_minimal () +Theme (axis.title.x=Element_blank (), Axis.title.y=Element_blank (), Axis.text.x=Element_blank (), Axis.text.y=Element_blank (), Panel.border=Element_blank (), Panel.grid=Element_blank (), Axis.ticks=Element_blank (), Plot.title=element_text (size= -, face="Bold")) Ggplot (Data=DF, Mapping=aes (x="Improved", y=freq,fill=improved)) +Geom_bar (Stat="Identity", width=0.5, position='Stack', size=5)+Coord_polar ("y", start=0)+scale_fill_manual (Values=c ("#999999","#E69F00","#56B4E9"))+Blank_theme+Geom_text (Stat="Identity", AES (y=freq, label = scales::p ercent (Freq/sum (Freq))), size=4, Position=position_stack (Vjust =0.5))

Three, Pareto analysis

Pareto analysis is based on the principle of 20/80 law, 80% of the benefits are often from 20% of the input, while the other 80% of the input only produced 20% of the benefits, which shows that the same investment in different places will have different benefits.

It is feared that the drawing process of the Pareto chart is arranged according to the contribution degree from high to low, and the cumulative contribution degree curve is drawn. When the number of samples is large enough, the contribution usually presents a 20/80 distribution.

The script and diagram for the Pareto diagram drawn with Ggplot2 are as follows:

Library ( grid) library (VCD) library (GGPLOT2) library (scales) labels<-C ("<","30-40","40-50","50-60","60-70",">=") Breaks<-C (1, -, +, -, -, -, -) MyTable<-cut (arthritis$age, breaks = breaks, labels = labels, right =TRUE) DF<- as. data.frame (Table (age=mytable), stringsasfactors=FALSE) DF<-transform (df, freqrate =prop.table (Freq)) DF<-Df[order (df$freq,decreasing =TRUE),]rownames (DF)<-seq (nrow (DF)) Df$age<-Factor (df$age,levels=df$age) Df$cumrate<-cumsum (df$freqrate) df$cumratelable<- as. Character (Percent (Df$cumrate)) df$cumratelable[1] <-""Ggplot (DF, AES (x=age,y=freqrate,fill=age)) +Geom_bar (Stat="Identity", width =0.7) +Geom_text (Stat='Identity', AES (Label=percent (Freqrate)), vjust=-0.5, color="Black", size=3)+scale_y_continuous (Name="cum Freq rate", Limits=c (0,1.1), labels = function (x) paste0 (x* -,"%"))+Geom_point (Aes (Y=cumrate), Show.legend=false) +Geom_text (Stat="Identity", AES (Label=cumratelable,y=cumrate), vjust=-0.5, size=3)+Geom_path (Aes (Y=cumrate, group=1))

Reference Documentation:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.