statistical methods in machine learning .
Statistics is a pillar of machine learning.
Primitive observations are just data, but they are not information or knowledge. Data raises problems, such as:
- What is the most common or expected observation?
- What are the limitations of observation?
- What does the data look like?
- Which variables are most relevant?
- What is the difference between the two experiments?
- Are these differences true or are the results of noise in the data?
Majority (mode), average (Mean) and median (Median)
In some cases, the majority, average, and median are measured in the center of the data.
The following two formulas calculate the average of sample and population, respectively:
To find out the median of the data, we first need to sort the data. Suppose we have n ordered numbers, which are x1,x2,x3,..., xn. Here's a formula for finding the median of them:
Q1, Q3, IQR, Variance, and standard deviation
See BoxPlot. Please see:
The meanings of Q1, Q3 and IQR have been clearly explained. From what we have seen is less than q1?1.5? IQR or greater than q3+1.5? IQR is an outlier that may exist. In some cases, statistics are used in such a way as to remove outliers.
Below, introduce a method to find Q1, Q3.
Find Q2, which is the median of the dataset, so divide the dataset into two parts
- Find the upper part of the median, namely Q3
- Looking for the lower half of the median, namely Q1
The variance and the standard deviation measure the degree of dispersion of the data. The formula for calculating variance and standard deviation is as follows:
But the absolute value is not simpler and clearer, it can also measure the degree of dispersion of data ah? Why do we have to take so much effort to square and ask for standard deviation in open radical? This is because in statistical analysis, the standard deviation has some very cool properties.
As we can see, in the normal distribution, about 68% of the data falls within the range of 1 standard deviations from the average distance, and about 95% of the data falls within the range of 2 standard deviations from the mean, and so on. In fact, we can find out what percentage of the data falls within the standard deviation range. Therefore, it is important to find the standard deviation.
If our dataset is the entire population, then the formula for standard deviation is the same as above. But if our dataset is just sample extracted from population, our formula is as follows:
Call it the Sample standard deviation. Intuitively, most of the data in population is distributed in the center, so the data in our sample is basically from the center, so the standard deviation calculated is smaller than the true standard deviation, because its data is scattered less. So we're going to use N-1 to solve (called Bessel's Correction), which will make the standard deviation we find closer to the true standard deviation. Sample standard deviation is the estimate of the population sigma.
Z-score and normal distribution
Z-score represents a few standard deviations between an element and mean. It calculates the following formula:
- X: the value of the element
- & #x03BC; & #xFF1A; >μ: average
σ: Standard deviation
When we standardization the normal distribution (that is, the z-score process), we will get a standard normal distribution, that is, a normal distribution with a mean of 0 and a standard deviation of 1.
In the normal distribution in, the x-axis randomly selects a probability of less than x equal to the area of negative infinity to X and curve formation.
You can use the knowledge of calculus to find out the area formed between any two points and the curve. We can also use z-table to find an area smaller than an X-value. However, before using z-table, we have to standardization the normal distribution, that is, to find the corresponding X-value z-score.
Central limit theorem (theorem)
Suppose a sample contains a lot of observations, each observation is randomly generated and is independent of each other, calculating the average of this sample. Repeating the average of such a sample, the central limit theorem tells us that these averages are subject to normal distribution.
In probability theory, the central limit theorem is defined as: Under certain conditions, no matter what the potential population distribution is, the arithmetic mean of independent random variables is calculated in large numbers, and these averages are subject to normal distribution.
Sample distribution (sampling distribution)
The sample distribution on Wikipedia is defined as: in statistics, a sampling distribution or finite-sample distribution is the probability distribution of a Given statistic based on a random sample.
As an example, suppose we have a normal distribution with a mean of μ and a variance of σ2. We repeatedly remove the samples from this population and calculate the average of each sample, which is called sample mean.
Each sample has an average value, the distribution of which is called sampling distribution of the sample mean.
Since the distribution of population is normally distributed, this distribution is also normal, it obeys N (μ,σ2/n), where n is the sample size. According to the central limit theorem, even if the population distribution is not normal, sampling distribution is usually close to the normal distribution.
Example
Here are 10 examples of using statistical methods in application machine learning projects.
- problem Framework : Exploratory data analysis and data mining are required.
- Data Understanding : You need to use summary statistics and data visualization.
- Data Cleansing . The use of anomaly detection, normalization, etc. is required.
- Data Selection . You need to use data sampling and feature selection methods.
- Data Preparation . Data transformations, scaling, encoding, and so on are required.
- model Calculations . Experimental design and resampling methods are required.
- model configuration . Statistical hypothesis testing and estimation statistics are required.
- model Selection . Statistical hypothesis testing and estimation statistics are required.
- model representation . You need to use estimated statistics, such as confidence intervals.
- model predictions . You need to use estimated statistics, such as the forecast interval.
Statistical Methods for Machine learning