Author: vamei Source: http://www.cnblogs.com/vamei welcome reprint, please also keep this statement. Thank you!
There are many methods to study data, such as using statistical methods to calculate the average and standard deviation of data, and then using models to fit data. Data is usually massive, and it is difficult for the human brain to grasp the information directly. The ultimate goal of data research is to reduce the amount of information in massive data, display the information in the data objectively, and finally organize the information into simple, human brain can grasp the knowledge.
Data Visualization
Graphics are a direct way to visualize data. However, it is not easy to draw large amounts of data in the same chart. Early surveying and weather data must be manually drawn for a long time. With the development of the computer drawing function, manual drawing has been completely automatedProgramReplace. The Core transfer of the problem is how to present data so that the information in the data can be naturally reflected. Data Visualisation is to study how to use graphs to show hidden information in data and explore the rules contained in data. It is a comprehensive discipline that spans computer, statistics, and psychology, and is more prosperous with the rise of data mining and big data.
The following video comes from Hans Rosling. He is a Swedish Medical Scientist and statistician. In the following BBC video, Hans Rosling uses a wide range of visualization techniques to demonstrate the evolution of population and income in the world over the past two hundred years. I added Chinese and English subtitles to this video to facilitate viewing. If any error occurs, please forgive me.
Http://v.youku.com/v_show/id_XNTA3NDk0MTk2.html
Data Information dimension
The data graphs made by Hans Rosling are worth studying. The basic information displayed by data has two dimensions:
1) x axis,Per capita income
2) Y axis,Per capita life
These two axes are the most basic information that the author wants to express. Each vertex in the figure represents a country, and the X-y position represents the country's per capita income and per capita life. The scale of average life expectancy increases linearly (25, 50, 75), but the scale of average income is exponential growth (400 yuan, 4000 yuan, 40000 yuan ). The average income scale is worth noting. Otherwise, it is easy to give people a wrong impression. For example, the average income of A, B, and C in three countries is 40,400 RMB and 40000 RMB respectively. In the chart, both A and B, B and C differ by only one scale, but the income gap between C and B is actually 10 times the income gap between A and B!
In addition, there are two dimensions of auxiliary information:
3) circle size,National Population
4) circle color,Country Region
A plane can naturally be divided into two dimensions (for example, X and Y ). To add information for other dimensions, we need to consider other independent representation methods. Data points can have size and color changes. As shown here, Han Rosling uses these two image features to represent two independent dimensions (country population, country region ).
Han Rosling also has an obvious change in the year of the video)Time Dimension. It is a common method for data visualization to use animations to record the changes of information over time. However, exercise caution when using animations. Animation leaves a relatively small amount of time for the audience to think deeply. Therefore, you need to pause (OR) appropriately during the animation process to display some typical situations.
Finally, the entire data has a very hidden information dimension, that is, Han Rosling will indicate 6 represented by a circle from time to time)Country name. That is to say, the country name is also an implicit information that can be obtained at any time.
Why?
With Han Rosling's impassioned speech, we were taken to the conclusion that the gap between income and life in the world is decreasing. As a whole, the world has become richer and healthier.
The data seems to be explaining this. Or not?
For example, the above two are used to show that the gap between countries is decreasing. However, as we mentioned above, the income scale is 10 times larger (such a scale is called a logarithm scale ). Therefore, when a country is richer, its wealth growth is more difficult to reflect on the scale. For example, an increase of 3600 yuan in income allows a country with an original income of 400 yuan to enter the middle area, while a country with an original income of 40000 yuan is almost unchanged. If you change the X axis to linear, the gap in income per capita between countries will greatly surpass the intuitive feeling given by this figure.
(The so-called overall income growth conclusion is not very reliable without considering inflation .)
According to the Y axis, the health of the whole world is improved. Even so, we should be careful. For example, the two images below draw the same data (S & P 500 index). The only difference lies in the Y axis scale range.
Is the second image ever more fluctuating? However, the two images share the same data! It can be seen that the scale scope affects people's cognition of data. A small scale may make people feel that the data changes a lot (even if the data is the same ).
Therefore, a chart is composedDataAndPlotting MethodIt consists of two aspects. Charts are not equivalent to data. Plotting may affect people's subjective understanding. A qualified data chart should reflect data as objectively as possible.
(Of course, a person familiar with the principles of data visualization may also use these methods to deliberately exaggerate them. This is often seen in posters .)
Elements of Data plotting
Now, think about changing the position. Suppose we have a batch of data, how should we start to present it? This is not a good answer for two reasons:
1) Data contains a large number of information dimensions. We can only select some of them rather than all.
2) data information is presented in a variety of ways
We need to determine the information to be drawn first.Dimension. For example, in the preceding video, the six information dimensions are displayed. In the S & P 500 plot, we only present information of two dimensions, time and index. If there are few information dimensions in the image, the chart will be easier to understand. If there are many information dimensions, the chart will be more complex, but it is easier to reflect the relationship between multiple variables.
Each information dimension requiresCoordinatesTo represent the value of data in this dimension. In the Hans Rosling plot, the six coordinates are horizontal X axis, vertical Y axis, circle color, circle size, the time corresponding to the animation frame, and the country name indicated by the text. These six dimensions are independent of each other, so they can reflect the values of each dimension without interference. Then compare the bar chart and pie chart below. They all reflect two-dimensional information. The bar chart adopts the coordinates of x-y. The pie chart adopts the coordinates of the text-circle angle.
Each coordinate must haveScale. The reader must obtain the exact value of the Data Based on the scale. Scales can increase uniformly or unevenly (for example, the log scales ). The scale selection depends on the data features. If the values of different data samples vary greatly in a certain dimension, this method is applicable to the logarithm value. For example, the XKCD 1162 image below shows the negative effects of not using a logarithm scale.
Log scale (XKCD 1162)
In addition, the scale also needs to haveRange. As we mentioned in the S & P 500 plot, a large scale range reduces the visual fluctuation. A common scale range is the maximum and minimum values of data in this dimension. However, in some cases, the maximum and minimum values may be unreliable data due to incorrect conditions, so the range of mean addition and subtraction standard deviation is adopted.
After the dimensions and scales are selected, you need to move them to the coordinate axis.AnnotationWhat is this dimension, the unit of the dimension, and on the coordinate axisScale value. In this way, we can complete the information of the data in this dimension. A data drawing with coordinate axes and scales but not labeled is a failed drawing. Readers cannot find out the real status of the data.
(In the Hans Rosling plot, information in two dimensions is incomplete: total population and country name)
After completing the preceding steps, we need to further describeData Source. We can add text instructions to achieve this (for example, adding a title ).
Summary
Data visualization is very interesting. It uses technical means to make boring data lively and cute. With the advent of the big data era and the development of network communication, data visualization will become a skill worth mastering. This articleArticleI used a great video to illustrate some key points of data visualization, especially the information dimension. I will introduce some commonly used plotting tools in the future to turn theory into practice.