The initial stage of data analysis usually requires visualization.
Data visualization aims to visually display the analysis results and ideas of information, and to visualize certain abstract data, including the nature or quantity of data measurement units. The library matplotlib used in this chapter is a Python library built on Numpy. It provides an object-oriented API and a procedural MATLAB API, which can be used in parallel.
1. Getting started with matplotlib drawing
Code:
import matplotlib.pyplot as plt
import numpy as np
x=np.linspace(0,20) #linspace() function specifies the range of abscissa
plt.plot(x,.5+x)
plt.plot(x,1+2*x,'--')
plt.show()
2. Logarithmic graph
The so-called logarithmic graph is actually a graph drawn using logarithmic coordinates. For the logarithmic scale, the interval represents the magnitude of the change in the value of the variable, which is very different from the linear scale. Logarithmic graphs are divided into two different types, one of which is called dual logarithmic graphs, which is characterized by logarithmic scales on both axes, and the corresponding matplotlibh function is matplotlib.pyplot..loglog(). One axis of the semi-logarithmic graph uses a linear scale, and the other axis uses a logarithmic scale. Its corresponding matplotlib API is the semilogx() function and the semilogy() function. On the double logarithmic graph, the power law is expressed as a straight line; On a semi-logarithmic graph, the straight line represents the law of exponentiality.
Moore's Law roughly states that the number of transistors on integrated circuits doubles every two years. There is a data sheet on the https://en.wikipedia.org/wiki/Transistor_count#Microprocessors page that records the number of transistors on microprocessors in different years. We make a CSV file for these data, named transcount.csv, which only contains the number of transistors and the year value.
Code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df=pd.read_csv('H:\Python\data\\transcount.csv')
df=df.groupby('year').aggregate(np.mean) #Group by year, aggregate by number mean
#print grouped.mean()
years=df.index.values #Get all year information
counts=df['trans_count'].values
#print counts
poly=np.polyfit(years,np.log(counts),deg=1) #Linear fitting data
print "poly:",poly
plt.semilogy(years,counts,'o')
plt.semilogy(years,np.exp(np.polyval(poly,years))) #polyval is used to evaluate polynomials
plt.show()
#print df
#df=df.groupby('year').aggregate(np.mean)
3. Scatter chart
The scatter plot can visually show the relationship between two variables in the rectangular coordinate system, and the position of each data point is actually the value of the two variables. Bubble chart is an extension of scatter chart. In a bubble chart, each data point is surrounded by a bubble, which gives it its name; and the value of the third variable can be used to determine the relative size of the bubble.
On the https://en.wikipedia.org/wiki/Transistor_count#GPU page, there is a data table that records the number of GPU crystals. We will create a new table gpu_transcount.csv with the year data of these transistors. Use the scatter() function provided by the matplotlib API to draw a scatter plot.
Code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df=pd.read_csv('H:\Python\data\\transcount.csv')
df=df.groupby('year').aggregate(np.mean)
gpu=pd.read_csv('H:\Python\data\\gpu_transcount.csv')
gpu=gpu.groupby('year').aggregate(np.mean)
df=pd.merge(df,gpu,how='outer',left_index=True,right_index=True)
df=df.replace(np.nan,0)
print df
years=df.index.values
counts=df['trans_count'].values
gpu_counts=df['gpu_counts'].values
cnt_log=np.log(counts)
plt.scatter(years,cnt_log,c=200*years,s=20+200*gpu_counts/gpu_counts.max(),alpha=0.5) # means color, s means scalar or array
plt.show()
4. Legends and annotations
Legends and annotations are definitely indispensable if you want to make an eye-catching god map. Under normal circumstances, the data graph is accompanied by the following auxiliary information.
Used to describe the legend of each data sequence in the figure, the legend() function provided by matplotlib can provide corresponding labels for each data sequence.
Notes on the main points in the figure. You can use the annotate() function provided by matplotlib.
The labels of the horizontal axis and the vertical axis can be drawn by xlabel() and ylabel().
A descriptive title, usually provided by matplotlib's title function.
The grid is very helpful for easily locating data points. The grid() function can be used to decide whether to use a grid.
Code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df=pd.read_csv('H:\Python\data\\transcount.csv')
df=df.groupby('year').aggregate(np.mean)
gpu=pd.read_csv('H:\Python\data\\gpu_transcount.csv')
gpu=gpu.groupby('year').aggregate(np.mean)
df=pd.merge(df,gpu,how='outer',left_index=True,right_index=True)
df=df.replace(np.nan,0)
years=df.index.values
counts=df['trans_count'].values
gpu_counts=df['gpu_counts'].values
#print df
poly=np.polyfit(years,np.log(counts),deg=1)
plt.plot(years,np.polyval(poly,years),label='Fit')
gpu_start=gpu.index.values.min()
y_ann=np.log(df.at[gpu_start,'trans_count'])
ann_str="First GPU\n %d"%gpu_start
plt.annotate(ann_str,xy=(gpu_start,y_ann),arrowprops=dict(arrowstyle="->"),xytext=(-30,+70),textcoords='offset points')
cnt_log=np.log(counts)
plt.scatter(years,cnt_log,c=200*years,s=20+200*gpu_counts/gpu_counts.max(),alpha=0.5,label="Scatter") # means color, s means scalar or array
plt.legend(loc="upper left")
plt.grid()
plt.xlabel("Year")
plt.ylabel("Log Transistor Counts",fontsize=16)
plt.title("Moore's Law & Transistor Counts")
plt.show()
5. Three-dimensional diagram
Axes3D is a class provided by matplotlib that can be used to draw three-dimensional graphs. By explaining the working mechanism of this class, you can understand the principle of the object-oriented matplotlib API. The Figure class of matplotlib is the top container for storing various image elements.
Code:
from mpl_toolkits.mplot3d.axes3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df=pd.read_csv('H:\Python\data\\transcount.csv')
df=df.groupby('year').aggregate(np.mean)
gpu=pd.read_csv('H:\Python\data\\gpu_transcount.csv')
gpu=gpu.groupby('year').aggregate(np.mean)
df=pd.merge(df,gpu,how='outer',left_index=True,right_index=True)
df=df.replace(np.nan,0)
fig=plt.figure()
ax=Axes3D(fig)
X=df.index.values
Y=np.log(df['trans_count'].values)
X,Y=np.meshgrid(X,Y)
Z=np.log(df['gpu_counts'].values)
ax.plot_surface(X,Y,Z)
ax.set_xlabel('Year')
ax.set_ylabel('Log CPU transistor counts')
ax.set_zlabel('Log GPU transistor counts')
ax.set_title('Moore Law & Transistor counts')
plt.show()