1. Information visualization: histogram, probability density function and cumulative distribution function
histograms are used to display grouped numeric data,Histograms are used to represent quantitative data, there is no interval between rectangles, and values are represented by successive digital scales,
The area of the rectangle is proportional to the frequency (when the width of the data range is unequal, the width of each rectangle reflects the width of each interval, and the height of the rectangle reflects the frequency density of the interval)。 Probability density distribution function (PDF): Histogram normalization results. Cumulative distribution function (CDF): Cumulative frequency Normalization results. (line chart) 2. Measurement of concentration trends-average (mean μ, median, majority)
Average |
Calculation method |
When to use |
mean value μ |
Μ=∑x/n=∑fx/∑f |
Used when the data is very symmetrical and only shows a trend. The mean value is not sensitive to outliers (extrema), but is more stable for sampled data. |
Number of Median |
The numbers are arranged sequentially, from the smallest to the largest, and if there are an odd number of values, the median is the number in the middle. If there are n numbers, the median position is (n+1)/2, if there is an even number of values, the two intermediate numbers are added, and then divided by 2. The algorithm for the median position is: (n+1)/2, and two intermediate numbers are located on each side of the middle position. |
Used when the data is very symmetrical and only shows a trend. The mean value is not sensitive to outliers (extrema), but is more stable for sampled data. |
The majority of |
The maximum number of frequencies. The number of people may be more than one. If more than one number has the maximum frequency, then each of these values is the majority. If the data appears to reflect multiple trends or multiple batches of data, give a majority for each batch of data. |
Used when the category data is encountered. Used when the data can be divided into two or more groups. |
dispersion: Full pitch, four min distance, etc.3. Dispersion and variability full range (very poor): the use of a full-distance data set, only describes the width of the data, there is no description of the distribution of data patterns. Four min. four min.-Lower four-bit number, which is less affected by outliers than full-distance. (Bottom four: N/4, if an integer, take n/4 this position and the middle of the next position, take the mean value of the two positions, if not an integer, then rounding up. Upper Four: 3N/4, if an integer, is the middle of the position and the next position of the 3N/4, taking the average of the values in both positions, and rounding up if it is not an integer. Use
Box Line DiagramDraws various "pitches", showing the full distance of the data, the four-bit distance, and the median.
variability: Observe the distance between each value and the mean. The smaller it is, the closer it is to the mean value. Average distance: The positive and negative distances are easily offset from each other. Variance: Prevents the distance from the distance from being offset against each other.
Variance =∑ (x-μ) 2/n=∑ (x-μ) (x-μ)/n=∑x2/n-μ2
Standard deviation(σ) =√ variance
Standard points:z= (x-μ)/σ compares data in different datasets, compares the data in different environments, converts these datasets into more general distribution patterns (mean 0, standard deviation 1), and ensures that the basic shape is unchanged.
Outlier Monitoring: Outliers are defined as values that deviate from the average of three standard deviations (the standard points of a value are not between 3 and 3).
Data Analysis Overview 02: In-depth statistics-BASIC statistics 1