Common mathematical formulas for data analysis (update ... )

Source: Internet
Author: User
Tags natural logarithm repetition

1 , Variance: is the degree of deviation from the center! Used to measure the amount of fluctuation in a batch of data (that is, the amount of data deviating from the average) and call it the variance of this set of data. The standard deviation is the square root of the variance.

Formula:

Examples: For example, 1.2.3.4.5, the average of these five numbers is 3.

The variance is:

namely: 1/5[(1-3) ²+ (2-3) ²+ (3-3) ²+ (4-3) ²+ (5-3) ²]=2

2 , Kmeans is one of the simplest clustering algorithms, but it is widely used. Kmeans generally used in the pre-data analysis, select the appropriate k, classify the data, and then classify the characteristics of the data under different clustering.

3 , normal distribution (normal distribution), also known as Gaussian distribution (Gaussian distribution), is a very important probability distribution in the fields of mathematics, physics and engineering, and has significant influence on many aspects of statistics. If the random variable x obeys a Gaussian distribution with a mathematical expectation of μ and a variance of σ^2, it is recorded as N (μ,σ^2).

4 , sum : ∑ pronunciation is sigma, English means sum. Where I is the lower bound, n is the upper bound, K is taken from I, and always takes to N, all add up.

5 , logarithmic function: if Ax=n (a>0, and a≠1), then the number x is called a logarithm of the base N, recorded as X=logan, read as a logarithm of the bottom n, where a is called the base of the logarithm, N is called the true number.

The graphs are as follows:

6 , natural logarithm: The logarithm of the base of the constant e is called the natural logarithm, which is recorded as Lnn (n>0). What it means is the limit value that can be reached in a constant doubling of growth per unit of time.

The base e of the natural logarithm is given by an important limit. We define: When n tends to infinity,.

E is an infinite repeating decimal, whose value is approximately equal to 2.718281828459 ..., it is a transcendental number.

7 , least squares

Residuals: Set Yi is an I-sample observation of the interpreted variable, and yi^ is the corresponding i-th sample estimate. The deviation between Yi and yi^ is recorded as the residual of EI, which is the observed value of the second sample.

Least squares criterion: the sum of the residuals of all sample observations is minimized.

8 , derivative: the derivative is the local property of the function. The derivative of a function at a point describes the rate of change in this function around this point. If the function's arguments and values are real, the derivative of the function at a point is the tangent slope of the curve at that point represented by the function. The essence of derivative is the local linear approximation of the function by the concept of limit. In kinematics, for example, the derivative of an object's displacement is the instantaneous velocity of the object.

9 , Basic Elementary function:http://baike.baidu.com/view/363955.htm

Ten , each step function and graph

Two steps

Multi-Item Step

Normal step

Bo efforts to test

T step

Uniform Step

Poisson step

One , poisson step: It is a kind of discrete probability distribution which is common in statistic and probability science.

The Poisson distribution is suitable for describing the number of random events that occur per unit time. If a service facility arrives in a certain amount of time, the number of calls received by the telephone switch, the number of guests waiting on the bus station, the number of failures in the machine, the number of natural disasters, etc.

A , Bo's efforts and two steps: (Bernoulli experiment) is a randomized trial carried out repeatedly and independently under the same conditions. It is characterized by the possibility that there are only two possible outcomes of this randomized trial: to occur or not to occur.

Two distribution: generally, in n independent repetition test, the number of occurrences of event A is expressed with ξ, if the probability of occurrence is p, then the probability of the occurrence of k times in the Q=1-p,n independent repetition test is:

Then say ξ obeys two distributions: where P is called the probability of success.

Recorded as: Ξ~b (n,p)

Expected: EΞ=NP

Variance: DΞ=NPQ

- , permutation: the number of all permutations of M (m≤n) elements taken from n different elements, called the number of permutations of M elements from n different elements, denoted by a symbol anm (or PNM, or NPM).

Formula:

- , combination: Generally, from m different elements, any n (n≤m) element is a group called a combination of n elements taken from a different m element.

Formula:

the , factorial: the factorial of a positive integer (English: factorial) is the product of all positive integers less than and equal to the number, and has a 0 factorial of 1. The factorial writing n! of the natural number N.

- , neural network algorithm: Artificial Neural network is a common classifier just to solve the tasks that human beings can solve. Machine learning models can only solve two problems: Feature selection (Feature Selection) and function fitting (functions Fitting)

- , Box chart (box end): is a statistical chart used to display a set of data dispersion information.

(1) box plots give us a standard for identifying outliers: outliers are defined as values less than Q1-1.5IQR or greater than Q3+1.5IQR.

(2) Determine the data bias and tail weight.

(3) Compare the shapes of several batches of data

- , Support vector Machine (SVM): super-popular explanation: Support vector machines are used to solve classification problems.

First consider the simplest case, peas and rice, with a sieve can quickly separate, small particles leak down, large particles retained.

A function means that when the diameter d is greater than a certain value D, it is determined to be a pea, less than a certain value is a grain of rice.

D>d, Pea

D<d, Rice

On the axis is on the left of D is the rice, the right is mung bean, this is a one-dimensional situation.

But the actual problem is not so simple, consider the problem is not only the size, a flower of two varieties, how to classify?

Assume that they classify two properties, petal size and color. Separate with an attribute to classify, like just a grain of rice, it is not. At this time we set two values of size x and color Y.

We put all the data on the X-y plane as the point, according to the truth if only these two properties determine the two varieties, the data will certainly be clustered in two categories on this two-dimensional plane.

We just have to find a straight line, divide the two categories, the classification is very easy, and later encountered a data, dropped into this plane, see in the straight line which side, is which category.

For example x+y-2=0 this line, we put the data (x, y) into, as long as the x+y-2>0 is a class, X+y-2<0 is Class B.

And so on, there are three-dimensional, four-dimensional, n-dimension attributes of the classification, so the structure may not be a straight line, but the plane, super-plane.

A three-dimensional function classification: x+y+z-2=0, this is a classification of the plane.

Sometimes, the line of the classification is not necessarily a straight line, there may be a curve, we can transform through some functions, it will be converted into just what kind of multidimensional classification problem, this is the idea of nuclear function.

For example: The function of classification is a circular x^2+y^2-4=0. This time makes x^2=a; Y^2=b, it has not become a+b-4=0 this straight line problem.

This is the idea of support vector machines. Machine means "algorithm", the machine learning field is often used in the word "machine" to express the algorithm.

Support vector meaning is the data set of some of the points, the location of a special, such as the X+y-2=0 line mentioned earlier, the line above the area x+y-2>0 is all a class, the following x+y-2<0 is all B class, we look for this straight line, generally see the two types of data gathered together, The points of their respective most marginal positions, which are closest to the points dividing the lines, and the other points do not work for the determination of the final position of the line, so I call these points "support points" (meaning useful points), but mathematically, there is no such argument, and the points in mathematics can be called vectors, such as two-dimensional points ( X, y) is a two-dimensional vector, three-dimensional vector (x, y, z). So the "support point" is called "Support vector", it sounds more professional, NB. So that's the support vector machine.

Common mathematical formulas for data analysis (update ... )

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.