The system analysis of R language combined with probability statistics---digital characteristics

Source: Internet
Author: User

Now there is a person, how to identify this person to this person? Then the characteristics of its existence, such as extracting its height, its appearance, its age, analysis of these characteristics, thus determining that this person is this person, we will never admit the mistake.

Similarly, the analysis of the data, but also extracts the characteristics of the data, the characteristics of the analysis, so as to determine the status of the information presented by these data, so as to determine the uniqueness and uniqueness of the data, because he presented the information is unique, and is never the same as the other.

So what are these characteristics? What characteristics do you have? Seems to have been a summary of countless scientists, and finally found a number of important features, including digital features and distribution characteristics, the digital characteristics, including the concentration of the location, dispersion, and then the overall distribution, is a general description, in fact, here there is a doubt, digital features and distribution characteristics, what is the difference? What do they rely on to differentiate in nature?

Can try to analyze from an angle, the digital characteristics, that is, the concentration position and the degree of dispersion is a number, in the coordinates, he just represents a point, a number. And the distribution features, it is shown as a graph, he can be two-dimensional graphics, can also be three-dimensional graphics, and even can be n-dimensional graphics. So it can be said that the digital characteristics and distribution characteristics, respectively, the data information in the lower dimensions and high latitude.

Finally, why do we analyze the data? What does it mean to analyze data? In fact, just as we know the characteristics of a person is to judge a person, we analyze the characteristics of the data, is to statistical inference, for its service.

Understand the above points, the preliminary establishment of a good logical structure and logical starting point, that is, the data---"Feature analysis---" Information presentation---"Statistical inference, the following can be all eyes gathered in one of the key and core position, that is, feature analysis.

The problem is that now we have to study the concentration, the degree of dispersion and the overall distribution, so what is it that describes him, or what does it mean to show him?

First, the central location.

Used to show centralized position, mean, number, median and percentile

1, mean value:

Simple is the average.

The formula is defined as:

In the R language, the formula for averaging is mean (x), X is a sample, can be a vector, here we have to mention, vectors, there is an explanation in linear algebra, that is, the vector is a set of ordered arrays, this definition is the most appropriate place. From the point of view of data analysis, there is only one data, there is no need for analysis, if there is more data, there is the need for analysis, these multiple data, put together, so as to form a group of numbers. The mathematical representation of this set of numbers is a vector.

So why the vector can represent an ordered array, you can actually think, I now set the starting point of the vector in the coordinates of the origin, then the end of the vector can be very fixed and accurate representation of a point, the point, in different dimensions, indicating that the point of the number is more or less, such as in a coordinate, That is, on the axis, this point is only one number, which makes it a. In two-dimensional space, this point is (A, b), in three-dimensional space, the point is (A,B,C), four-dimensional space, the coordinates of this point (A,b,c,d), in the same way, in the N-dimension space, the coordinates of this point is (a,b,c,d ...). , looking at such a pattern, it can be seen that vectors represent a set of numbers that exist, regardless of the number of data in the group, which is why the vector is so defined in linear algebra.

If you only look at the linear algebra in a college textbook, you'll find that there's just a new definition of the vector, and as to why the new definition is given, he's not saying it, what I want to say is, is this still a textbook? Perhaps just a reference book, this shows that China's education in the higher levels, or do very rotten, I can not help but doubt, those guys in the end also understand it? This is probably the reason why Chinese maths is weak at the late stage, the teacher who can give us good instruction, almost lacks.

In fact, I think, linear algebra, that a book, is essentially an ordered array, an ordered array and an ordered array of relationships before the knowledge, if he is essentially speaking of vectors, or essentially a matrix, then who is the essence, in the end, who is essentially true by the beginning of the performance of who? So they end up being just two sides of an ordered array ...

Not much to say, back to the point ...

is to apply the average to the matrix, 1, the line to mean, if it becomes 2, then the column is averaged.

2, Majority

Which is the maximum number of occurrences

3, median

Is the data in the middle of a set of numbers, which must be sorted before the intermediate number is obtained.

The sort function is: sort ()

The formula for the median number:

The corresponding function:

4, percentile

The so-called percentile, such as the total number of samples is 20, divides him into 100 equal points, that is, 20/100, if in its 2,500 percentile, that is 20*25/100=20*25%=20/4=5, then, then, we study the number is pointing to the fifth number.

Functions in the R language:

In terms of dispersion degree.

Here are just two ways to behave, the difference and the variance.

The so-called extreme difference is the difference between the maximum and the minimum value, the general judgment of ordinary people, know that the difference between the maximum and minimum is easy to know, this really can represent a group of data dispersion degree.

However are as follows:

2,7,8,10 2,5,6,10

These two kinds of data, obviously can not be enough to judge his degree of dispersion, you have to find another way.

You can use |2-10+7-8|   |2-10+5-6| It is easy to conclude that the latter is less dispersed, but this method of calculation seems a bit inconvenient, he needs to divide the data into several parts to calculate, increase the complexity of the calculation, then there is no simpler, the effect is the same way?

Can be observed, 5-6 is actually equal to, 5-5.5+5.5-6 in fact there is an average in effect.

So we use each number, minus its average, because there is a positive negative, and because the absolute value is not convenient to calculate, so add a square, so that a sum of squared difference to express a group of discrete, but only the sum of squares? So how much of the data in a few sets of numbers is different? This is not a good comparison, so, again on this squared and get an average, so you can compare, so we get the variance, the formula is:

Finally, in this article, said the first several methods to indicate the concentration of the difference, when there is very obvious data from the group, the mean, it can not be very good to display a set of data in the central location, this time, it depends on the majority and the median, as for the percentile, can be used to display characteristics of features, that is, any one number is the overall performance of the state, such as Xiao Ming's 50 score, in the class results in the position, if his results, that is, 50 points in 7,500 points, that his results are in the upstream.

in the next article, we will talk about distribution and mapping system.

The system analysis of R language combined with probability statistics---digital characteristics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.