Today I will tell you about
Python's data visualization technology.
If you want to use Python for
data analysis, you need to start exploratory data analysis at the beginning of the project, which is convenient for you to have a certain understanding of the data. One of the most intuitive is the use of data visualization technology, so that the data is not only clear at a glance, but also easier to interpret. Also after data analysis to get the result, we also need to use visualization technology to present the final result.
What are the visual views?
According to the relationship between data, we can divide the visual view into 4 categories, which are comparison, connection, composition and distribution. Let me briefly introduce the characteristics of these four relationships:
Comparison: Compare the relationship between various categories of data, or their trend over time, such as a line chart;
Contact: View the relationship between two or more variables, such as a scatter plot;
Composition: Each part accounts for the percentage of the whole, or the percentage changes with time, such as a pie chart;
Distribution: Focus on the distribution of a single variable or multiple variables, such as a histogram.
Similarly, according to the number of variables, we can divide the visual view into univariate analysis and multivariate analysis.
Univariate analysis refers to focusing on only one variable at a time. For example, we only focus on the variable "height" to see the distribution of height values, while temporarily ignoring other variables.
Multivariate analysis allows you to view the relationship of more than two variables on one graph. For example, "height" and "age", you can understand the two parameters of the same person, so that you can see the value of "height" and "age" of each person in the same picture, so as to analyze the difference between these two variables Is there some connection.
The visual views can be said to be divided into various categories, and various. Today I mainly introduce 10 commonly used views, including scatter charts, line charts, histograms, bar charts, box line charts, pie charts, heat maps, spider maps , Binary variable distribution, paired relationship.
Scatter plot
The English of scatter plot is called scatter plot, which displays the values of two variables in two-dimensional coordinates, which is very suitable for showing the relationship between two variables. Of course, in addition to the two-dimensional scatter plot, we also have a three-dimensional scatter plot.
In Matplotlib, we often use the toolkit pyplot, which includes many drawing functions, similar to Matlab's drawing framework. You need to quote before using:
import matplotlib.pyplot as plt
After referencing the toolkit, to draw a scatter plot, you need to use the plt.scatter (x, y, marker = None) function. x and y are coordinates, and marker represents the symbol of the marker. For example "x", ">" or "o". Choose a different marker, the symbol style will be different, you can try it yourself.
In addition to Matplotlib, you can also use Seaborn to draw scatter plots. Before using Seaborn, package references are also required:
import seaborn as sns
After referencing the seaborn toolkit, you can use the functions of the seaborn toolkit. If you want to make a scatterplot, you can directly use the sns.jointplot (x, y, data = None, kind = 'scatter') function. Where x and y are subscripts in data. Data is the data we want to pass in, generally DataFrame type. We use scatter for kind, which means scatter. Of course, kind can also take other values. As I will talk about in later views, different kind represents different ways of drawing views.
line chart
Line charts can be used to represent trends in data over time.
In Matplotlib, we can directly use the plt.plot () function. Of course, we need to sort the data according to the size of the X axis in advance. Otherwise, the line chart cannot be displayed in the order of increasing X axis.
In Seaborn, we use the sns.lineplot (x, y, data = None) function. Where x and y are subscripts in data. Data is the data we want to pass in, generally DataFrame type.
Histogram
The histogram is a more common view. It divides the horizontal coordinate into a certain number of cells. This cell is also called a "box", and then displays the box with rectangles in each "box". The number of boxes (that is, the y value), which completes the visualization of the distribution of the histogram of the data set.
In Matplotlib, we use the plt.hist (x, bins = 10) function, where the parameter x is a one-dimensional array, bins represents the number of boxes in the histogram, and the default is 10.
In Seaborn, we use the sns.distplot (x, bins = 10, kde = True) function. The parameter x is a one-dimensional array, bins represents the number of bins in the histogram, kde represents the display kernel density estimate, the default is True, we can also set kde to False, do not display. Kernel density estimation is a method to help us estimate probability density through kernel function.
Author: Wang Xin xyx
Link: https://www.jianshu.com/p/1b4f351013d3
Source: Brief Book
The copyright belongs to the author. For commercial reproduction, please contact the author for authorization, and for non-commercial reproduction, please indicate the source.
Heat map
Heat map, called heat map in English, is a matrix representation method, in which the element values in the matrix are represented by colors, and different colors represent values of different sizes. You can intuitively know the size of the value at a certain location by color. In addition, you can also compare the color at this location with the color at other locations in the dataset.
Heat map is a very intuitive multivariate analysis method.
We generally use the sns.heatmap (data) function in Seaborn, where data represents the heat map data to be drawn.
Paired relationship
If you want to explore the distribution of multiple pairs of double variables in the data set, you can directly use the sns.pairplot () function. It will also show the relationship between each pair of variables in the DataFrame, and on the diagonal, you can see the distribution of each variable as a single variable. It can be said to be a commonly used function in exploratory analysis, which can quickly help us understand the relationship between variable pairs.
The use of the pairplot function is as convenient as the use of the describe () function on the DataFrame. It is a commonly used function in data exploration.
Here we use the iris dataset that comes with Seaborn. This dataset is also called the iris dataset. Iris flowers can be divided into three varieties: Setosa, Versicolour, and Virginia. In this data set, there are 50 data for each variety. Each data includes 4 attributes, namely, calyx length, calyx width, petal length and Petal width. Through these data, you need to predict which of the three varieties the iris flower belongs to.