"Data Analysis R Language Practice" study notes the descriptive analysis of the data in the fifth chapter (Part I)

Source: Internet
Author: User
Tags types of functions

5.1R built-in distribution

Distribution is the core and most important way to describe a sample data. R incorporates a number of commonly used statistical distributions and provides four types of functions: probability density function (density), cumulative distribution function (probability), Division number (quantile), and pseudo-random number (random). In R, each of the 4 items is represented by D,p,q,r, followed by the English name or abbreviation of the distribution.

Analysis of trends in episode 5.2

Measurement of 5.2.1 concentration trend

The indicators that describe the trend of statistical distribution are mainly average, median, and majority, also known as the average indicator. The main functions of these indicators include:

reflect the concentration trend and general level of the distribution of the overall units variables;

It is easy to compare the level of similar phenomena between different units;

It is convenient to compare the development trend or law of similar phenomena in different periods;

A dependency that is used to analyze the question of phenomena.

5.2.2 R Language Implementation

The function Summary () calculates the five-and mean-values of a set of data.

>summary (Cars$speed)

Min.1stqu.medianmean3rdqu.max.

4.012.015.015.419.025.0

5.3 Analysis of discrete trends

Measure of 5.3.1 Discrete trend

The degree of dispersion of the data is mainly measured by the statistical index of the difference, four difference, average difference, variance, standard, etc. In the actual analysis, the dispersion degree analysis mainly has the following functions:

To measure the representativeness of average indicators;

Reflect the balance of social and economic activities;

To study the situation of the distribution of the overall flag value deviating from normal state;

A basic indicator of statistical analysis such as sampling inference.

5.3.2 R Language Implementation

The extreme difference can be calculated from the function range (). Give a minimum and a maximum of two points, then subtract from it:

>m=range (Cars$speed)
>M[2]-M[1]
[1]21

The four-point difference also requires manual calculation, the more convenient way is to directly use the function Fivenum ()

>q=fivenum (Cars$speed)
>Q[4]-Q[2]
[1]7

The variance function in R and the standard deviation function are VAR () and SD () R also have a special function, that is, the dispersion mad (), which is used to calculate the median absolute deviation, with asymptotically normal consistency.

5.4 Analysis of data distribution

Measurement of distribution of 5.4.1

(1) Degree of skewness

(2) Peak degree

5.4.2R Language Implementation

In package timedate (or directly loading the Fbasics package), there are functions that directly calculate skewness and kurtosis coefficients for skewness () and kurtosis ()

>skewness (Cars$speed)
[1]-0.1105533
attr (, "method")
[1] "moment"
>kurtosis (Cars$speed)
[1]-0.6730924
attr (, "method")
[1] "excess"

5.5 Graphical analysis and R implementation

5.5.1 histogram and density function graphs

>hist (cars$speed,breaks=50,prob=t) #参数breaks设1直方图的组距, prob=t specifies the density histogram
>lines (Density (cars$speed), col= ' Blue ') #用核密度估计函数density (), plot the density graph

5.5.2 QQ Map

QQ graphs are used to visually verify whether a set of data comes from a distribution, or to verify that two sets of data are from the same family. In the teaching and software commonly used QQ Scatter chart to check whether the data from the normal distribution. QQ graph is the normal quantile-quantile graph, the horizontal axis is the theoretical value, the longitudinal shaft is the sample value, if the sample data approximate to obey the normal distribution, then the QQ map scatter should be evenly distributed around the line y=xσ+μ, the slope of the line is normally distributed

Standard deviation J, intercept is mean-value knife.

>qqnorm (Cars$speed)
>qqline (Cars$speed)

5.5.3 stem and leaf diagram

Using function stem () to draw stem and leaf plots in R

Stem (x,scale=1,width=80,atom=1e-08)

where x is the data vector, the scale controls the length of the stem and leaf graph, width controls the drawing's widths, and atom is the tolerance.

> Set.seed (111)
> S=sample (cars$speed,25)
> Stem (s)
  The decimal point was 1 digit (s) to the right of the |
  0 | 44
  0 | 779
  1 | 011233344
  1 | 5557889
  2 | 0344

5.5.4 Box Line diagram

> BoxPlot (cars$speed)

5.5.5 Experience Distribution Map

The function ecdf () in R gives the empirical distribution of the sample, plotted through plot ()

ECDF (x)

Plot (x,...,ylab= "Fn (x)", verticals=false,col.01line= "Gray70", peh=19)

"Data Analysis R Language Practice" study notes the descriptive analysis of the data in the fifth chapter (Part I)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.