This article is about Matplotlib, a powerful Python
visualization library. A total of 3 small pieces of code, easily implement scatter plots, line graphs, histograms, each piece of code has only 10 lines, it couldn't be easier!
Data visualization is a major task of data scientists. In the early stages of the project, exploratory data analysis (EDA) is usually performed to gain understanding and insights into the data. Especially for large, high-dimensional data sets, data visualization can really help make data relationships clearer and easier to understand.
At the same time, at the end of the project, it is also very important to display the final result in a clear, concise and eye-catching way, because the audience is often non-technical customers, and only in this way will it be easier for them to understand.
Matplotlib is a very popular Python library that can easily realize
data visualization. However, the process of setting data, parameters, and graphics is very cumbersome every time the drawing of a new project is executed. In this article, we will focus on 5 data visualization methods, using Python's Matplotlib library to implement some quick and simple functions.
Scatter plot
The scatter plot is very suitable for showing the relationship between two variables, because the original distribution of the data can be directly seen in the graph. You can also easily view the relationship between different sets of data by setting different colors.
Now write the code. First import the pyplot sub-library of the Matplotlib library and name it plt. Use the plt.subplots() command to create a new plot. Pass the x-axis and y-axis data to the corresponding arrays x_data and y_data, and then pass the array and other parameters to ax.scatter() to draw a scatter chart. We can also set the size, color and alpha transparency of the points, and even set the y-axis to logarithmic coordinates. Finally, set the necessary title and axis labels for the graph. This function easily realizes end-to-end drawing!
import matplotlib.pyplot as plt
import numpy as np
def scatterplot(x_data, y_data, x_label="", y_label="", title="", color = "r", yscale_log=False):
# Create the plot object
_, ax = plt.subplots()
# Plot the data, set the size (s), color and transparency (alpha)
# of the points
ax.scatter(x_data, y_data, s = 10, color = color, alpha = 0.75)
if yscale_log == True:
ax.set_yscale('log')
# Label the axes and provide a title
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
line chart
If one variable changes greatly with the change of another variable (with a high covariance), in order to clearly see the relationship between the variables, it is best to use a line chart. For example, according to the figure below, we can clearly see that the percentage of women who have obtained bachelor degrees in different majors has changed greatly over time.
At this point, if you use a scatter plot to plot, the data points are easy to cluster and appear very confusing, and it is difficult to see the meaning of the data itself. The line chart is more suitable, because it basically reflects the general situation of the covariance of the two variables (the proportion of women and the time). Similarly, different colors can be used to group multiple sets of data.
The code is similar to a scatter plot, with only minor parameter changes.
def lineplot(x_data, y_data, x_label="", y_label="", title=""):
# Create the plot object
_, ax = plt.subplots()
# Plot the best fit line, set the linewidth (lw), color and
# transparency (alpha) of the line
ax.plot(x_data, y_data, lw = 2, color ='#539caf', alpha = 1)
# Label the axes and provide a title
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
Histogram
The histogram is suitable for viewing (or discovering) the data distribution. The figure below is a histogram of the proportions of different IQ populations. It can be clearly seen that the expected value of the center and the median, which follow a normal distribution. Using a histogram (rather than a scatter plot) can clearly show the relative difference between the frequencies of different sets of data. Moreover, grouping (discretizing the data) helps to see the "more macroscopic distribution". If you use data points that are not discretized, a lot of data noise may be generated, making it difficult to see the true distribution of the data.
Below is the code for creating a histogram using the Matplotlib library. There are two parameters to note here. The first parameter is the n_bins parameter, which is used to control the dispersion of the histogram. On the one hand, more groupings can provide more detailed information, but data noise may be introduced to make the results deviate from the macroscopic distribution; on the other hand, fewer groupings can provide a more macroscopic "bird's eye view" of the data, which is not necessary. In the case of multiple details, the overall situation of the data can be more comprehensively understood. The second parameter is the cumulative parameter cumulative, which is a Boolean value that controls whether the histogram is cumulative, that is, whether to use the probability density function (PDF) or the cumulative density function (CDF).
def histogram(data, n_bins, cumulative=False, x_label = "", y_label = "", title = ""):
_, ax = plt.subplots()
ax.hist(data, n_bins = n_bins, cumulative = cumulative, color ='#539caf')
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)