"Data analysis using Python" reading notes--eighth chapter drawing and visualization

Source: Internet
Author: User
Tags diff


Python has many visual tools, and this book mainly explains Matplotlib. Matplotlib is a desktop drawing package for creating publishing quality charts (mainly 2D). The purpose of Matplotlib is to construct a MATLAB-style drawing interface. Most of the diagrams in this book are generated using it. In addition to the graphical display, you can also save the image as PDF, SVG, JPG, PNG, GIF and other forms.

1. Getting Started with Matplotlib API

Ipython can close the interface with close ().

Figure and subplot

Matplotlib images are located in the figure object. Create a new figure with plt.figure.

Import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as PLT "#plt. Plot (np.arange) FIG = plt.figure () #plt. Sho W () #figsize have some important options, especially figsize, which specifies a certain size aspect ratio when the picture is saved to disk. #plt. GCF () to get a reference to the current figure Ax1 = Fig.add_subplot (2,2,1) ax2 = Fig.add_subplot (2,2,2) ax3 = Fig.add_subplot (2,2,3) Plt.plot (Np.random.randn () cumsum (), ' k--') #fig. Add_subplot objects that are returned is Axessubplot object, and the following call is possible _ = Ax1.hist ( NP.RANDOM.RANDN (+), bins = 20,color = ' k ', alpha = 0.3) Ax2.scatter (Np.arange (3), Np.arange (+) + * NP.RANDOM.RANDN (30) Plt.show () "#由于Figure and subplot are a very common task, so there is a more convenient way (plt.subplots), which can create a new figure,# and returns a NumPy array containing the created subplot object fig,axes = plt.subplots (2,3) #print figprint axes[0][0] #axes [0][0].hist (Np.random.randn (+), bins = 20,color = ' k ', alpha = 0.3) plt.show () #这是非常实用的 because the axes array can be easily indexed as if it were a two-dimensional array, such as #axes[0,1]. You can also specify subplot with the same x-axis and y-axis through Sharex and Sharey. This is useful when comparing data in the same range, otherwise matplotlib automatically scales the bounds of each chart.

Take a look at the role of subplots:

The Pyplot.subplots options are also:

The above **fig_k can have a lot of parameters and more content in the document.

Adjust the spacing around subplot

By default, Matplotlib leaves a margin in the periphery of the subplot and leaves a certain amount of space between the subplot. The spacing is related to the height and width of the image and is automatically adjusted. By using the Subplots--adjust method of figure, you can modify the spacing, so it is a top-level function.

Import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as pltsubplots_adjust (left = None,bottom = None,right = None , top = None,wspace = None,hspace = None) #wspace和space用于控制宽度和高度的百分比, can be used as the spacing between subplot, here is an example: "' Fig,ax = Plt.subplots ( 2,2,sharex = True,sharey = True) for I in a range (2):     for J in Range (2):          ax[i,j].hist (NP.RANDOM.RANDN), bins = 50, color = ' k ', alpha = 0.5) plt.subplots_adjust (wspace = 0.5,hspace = 0.5) plt.show () #matplotlib不会检查标签的重叠 (it does).
#-*-Encoding:utf-8-*-import numpy as Npimport pandas as Pdimport matplotlib.pyplot as Pltfig,ax = Plt.subplots (2,2) #c Ecece is white ... ax[0,0].plot (np.arange), LineStyle = '--', color = ' #CECECE ') #线上面还可以添加一些标记 (marker) to emphasize the actual data points. Since matplotlib creates a continuous line graph, it may not be easy to see where the real point is, and the marker can be placed in the format string, but the marker type and linearity must be Ax[0,1].plot (NP.RANDOM.RANDN (30) after the color. Cumsum (), ' ko--') Ax[1,0].plot (Np.random.randn (), cumsum (), color = ' k ', LineStyle = '--', marker = ' o ') #在线型图中, The non-actual data points are linearly interpolated by default and can be modified by the DrawStyle option. data = NP.RANDOM.RANDN (+). Cumsum () ax[1,1].plot (data, ' ko--') ax[1,1].plot (data, ' k--', DrawStyle = ' steps-post ') Plt.show ()

Note that the DrawStyle option above specifies the connection between the point and the point, or the interpolation method, with the result:

Set title, axis label, scale, and tick labels

#-*-Encoding:utf-8-*-import numpy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltimport numpy.random as Nprfi g = plt.figure () ax = Fig.add_subplot (1,1,1) Ax.plot (Npr.randn () cumsum ()) #要想修改x的刻度, the simplest way is to use Set_xticks and set_ Xticklabels. The former tells Matplotlib where to place the # scale in the data range, which is the scale label by default. However, you can use Set_xticklabels to add any other value as a label ticks = Ax.set_xticks ([0,250,500,700,900,1000]) #下面的totation是规定旋转角度labels = Ax.set _xticklabels ([' A ', ' B ', ' C ', ' d ', ' e ', ' f '],rotation = 30,fontsize = ' small ') #可以为x轴设置名称ax. Set_xlabel (' Stages ') plt.show ( )


#-*-Encoding:utf-8-*-import numpy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltimport numpy.random as Nprfr Om datetime import datetime# add legend fig = plt.figure () ax = Fig.add_subplot (1,1,1) Ax.plot (NPR.RANDN (+). Cumsum (), ' K ', label = ' One ') Ax.plot (Npr.randn () cumsum (), ' k--', label = ' both ') Ax.plot (Npr.randn (), ' K. ', label = ' Three ') ax.legend (loc = ' best ') plt.show ()

Annotations and drawings

#-*-Encoding:utf-8-*-import numpy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltimport numpy.random as Nprfr Om datetime Import datetimefig = plt.figure () ax = Fig.add_subplot (1,1,1) data = pd.read_csv (' e:\\spx.csv ', Index_col = 0,pa Rse_dates = True) SPX = data[' SPX ']spx.plot (ax = Ax,style = ' k ') Crisis_data = [(DateTime (2007,10,11), ' Peak of bull market ') ), (DateTime (2008,3,12), ' Bear Stearns fails '), (DateTime (2008,9,15), ' Lehman bankruptcy ')]for Date,label in Crisis_data : ax.annotate (Label,xy = (date,spx.asof (date) +), Xytext = (date,spx.asof (date) + $), Arrowprops = d ICT (Facecolor = ' black '), HorizontalAlignment = ' left ', VerticalAlignment = ' top ') ax.set_xlim ([' 1/1/2007 ', ' 1/1/2011 ' ]) Ax.set_ylim ([600,1800]) ax.set_title (' Important dates in 2008-2009 finacial crisis ') plt.show () #更多关于注解的示例, see document # Drawing of graphics to be troublesome, there are some common graphical objects that become blocks (patches) #如Rectangle and Circle, the complete block is located in matplotlib.patches# to draw the graph, you need to create a block object shp, and then through Ax.add_ Patch (SHP) adds it to subplot in Fig = plt.figure () ax = Fig.adD_subplot (1,1,1) rect = Plt. Rectangle ((0.2,0.75), 0.4,0.15,color = ' k ', alpha = 0.3) Circ = Plt. Circle ((0.7,0.2), 0.15,color = ' B ', alpha = 0.3) Pgon = Plt. Polygon ([[0.15,0.15],[0.35,0.4],[0.2,0.6]],color = ' g ', alpha = 0.5) Ax.add_patch (rect) Ax.add_patch (CIRC) ax.add_ Patch (Pgon) plt.show ()

Save a chart to a file

#-*-Encoding:utf-8-*-import numpy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltimport numpy.random as Nprfr Om datetime import datetimefrom IO import stringio# Save the icon to a file #savefig function can save a drawing file with different extensions saved in a different format fig = plt.figure () ax = Fig.add_subplot (1,1,1) rect = Plt. Rectangle ((0.2,0.75), 0.4,0.15,color = ' k ', alpha = 0.3) Circ = Plt. Circle ((0.7,0.2), 0.15,color = ' B ', alpha = 0.3) Pgon = Plt. Polygon ([[0.15,0.15],[0.35,0.4],[0.2,0.6]],color = ' g ', alpha = 0.5) Ax.add_patch (rect) Ax.add_patch (CIRC) ax.add_ Patch (Pgon) #注意下面的dpi (dots per inch) and bbox_inches (which can prune the white space around the current icon) #plt. Savefig (' pic.jpg ', dpi = 100,bbox_inches = ' Tight ') #不一定save到文件中 can also be written to any file type object, such as Stringio:buffer = Stringio () plt.savefig (buffer) Plot_data = Buffer.getvalue () # This is useful for providing dynamically generated images on the Web #plt.show ()

Some options for Savefig:

Matplotlib Configuration

Some of the properties of Matplotlib can be set, such as size, subplot margin, color scheme, font size, grid type, and so on. There are two ways of doing this. The first is that Python becomes a way to use the RC method. Like what:

Plt.rc (' figure ', figsize = (10,10))

The first parameter of RC is the object that wants to customize, such as ' figure ', ' axes ', ' xtick ', ' ytick ', ' grid ', ' Legend ' and so on. You can then follow a series of keyword parameters. The simplest is to write a dictionary:

font_options = {' Family ': ' monospace ',                         ' weight ': ' bold ',                         ' size ': ' Small '}plt.rc (' font ', **font_options)

MATPLOTLIBRC is a configuration file that is defined to use the parameters that are set each time it is loaded.

2. Drawing functions in Pandas

Matplotlib is a low-level tool that requires a combination of components: Data presentation (line chart, histogram, etc.), legend, title, tick labels, and annotations. This is because creating a chart typically requires multiple objects. In the pandas, it will save a lot of trouble. Pandas can use Dataframe's object features to create advanced drawing methods for standard charts. The author says the best learning tool for pandas online documentation may be outdated.

Line chart

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltfrom pandas import Series,dataf Rames = Series (NP.RANDOM.RANDN) cumsum (), index = Np.arange (0,100,10)) #该Series对象的索引会被传给matplotlib, and the x-axis is drawn. #可以用use_index = False disables the function S.plot (Use_index = False) #X轴的刻度和界限可以通过xticks和xlim选项进行调节, the y-axis adjusts plt.show () by Xticks and Ylim () # Most methods of pandas have an optional ax parameter, which can be a subplot object. This makes it possible to more flexibly handle the position of the subplot in the grid. #DataFrame的plot方法会在一个subplot中为各列绘制线型图, and automatically add the legend df = DataFrame (Np.random.randn (10,4). Cumsum (0),    columns = [' A ', ' B ', ' C ', ' D '],    index = Np.arange (0,100,10)) Df.plot () plt.show ()

Here are the parameters to paste:

Dataframe also has some parameters for column processing:

There are some special graphics from the beginning, which can be compared with the R language when drawing: http://www.cnblogs.com/batteryhp/p/4733474.html.

Bar chart

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltfrom pandas import Series,dataf rame# generated line graph in code plus kind = ' bar ' (vertical bar) or (horizontal) kind = ' Barh ' (horizontal bar chart) #Series和DataFrame的索引被用作X (bar) or Y (Barh) scale fig,axes = Plt.subplots (2,1) data = Series (NP.RANDOM.RANDN (+), index = list (' Abcdefghijklmnop ')) data.plot (kind = ' Barh ', ax = axes[ 0],color = ' k ', alpha = 0.7) data.plot (kind = ' bar ', ax = Axes[1],color = ' k ', alpha = 0.7) #DataFrame会按照行对数据进行分组df = DataFrame (Np.random.randn (6,4), index = [' One ', ' both ', ' three ', ' four ', ' five ', ' six '],    columns = PD. Index ([' A ', ' B ', ' C ', ' D '],name = ' genus ')) #注意这里的name会被用作图例的标题, because, this would have been the name of the column print dfdf.plot (kind = ' bar ') plt.show () # Here the stacked is marked with a picture of the cumulative bar chart df.plot (kind = ' bar ', stacked = True,alpha = 0.5) plt.show () #Series的value_ Counts can be used to display the frequency of each value in a series (experimental proof) s = Series ([1,2,2,3,4,4,4,5,5,5]) s.value_counts (). Plot (kind = ' bar ') plt.show ()

Let's look at an example:

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltfrom pandas import Series,dataf rame# Here's an example: Make a stacked bar chart to show the percentage of data points per day for various party sizes tips = pd.read_csv (' e:\\tips.csv ') party_counts = Pd.crosstab (Tips.day, tips.size) Print party_countsparty_counts = Party_counts.ix[:,2:5] #然后进行归一化是各行和为1party_pcts = Party_counts.div (party_ Counts.sum (1). Astype (float), axis = 0) print party_pctsparty_pcts.plot (kind = ' bar ', stacked = True) plt.show ()

Histogram and density graphs

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltfrom pandas import Series,dataf rame# draw tip percentage Histogram tips = pd.read_csv (' e:\\tips.csv ') tips[' tip_pct '] = tips[' tip '/tips[' Total_bill ']# BINS specifies how many groups tips[' tip_pct '].hist (bins =) plt.show () #与此相关的是密度图: He is #而产生的 by calculating "an estimate of the continuous probability distribution that may produce observational data". The general process is to think of the distribution of gold in a set of cores (such as simple distributions such as normal states). #此时的密度图称为KDE图. Kind = ' KDE '. tips[' tip_pct '].plot (kind = ' KDE ') plt.show () #显然, histograms and density graphs often appear together comp1 = Np.random.normal (0,1,size = $) COMP2 = Np.random.normal (10,2,size = $) values = Series (Np.concatenate ([COMP1,COMP2])) Print valuesvalues.hist (bins = 100, Alpha = 0.3,color = ' k ', normed = True) values.plot (kind = ' kde ', style = ' k--') plt.show ()

Scatter map

Scatter plots (scantter plot) are an effective means of observing the relationship between two one-dimensional data sequences. The Scantter method in Matplotlib is the main method for plotting scatter graphs.

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltfrom pandas import Series,dataf rame# load the datasets in Macrodata below, select several columns and calculate the logarithmic difference macro = pd.read_csv (' e:\\macrodata.csv ') data = macro[[' CPI ', ' M1 ', ' tbilrate ', ' Unemp ']] #这里的diff函数是用来计算相邻两数只差, for each column, the latter number minus the previous number Trans_data = Np.log (data). diff (). Dropna () #print np.log (data). Head () # Print Np.log (data). diff (). Head () print trans_data.head () plt.scatter (trans_data[' M1 '],trans_data[' unemp ']) Plt.title (' Changes in log%s ' vs. log%s '% (' M1 ', ' unemp ') ' Plt.show () #画散布图矩阵式很有意义的pandas提供了scantter_matrix函数来创建散步矩阵 # About The diagonal parameter is for some graphical display of this data, such as diagonal = ' KDE ' is a density map and is core to KDE, in order not to let the diagonal graph (itself and its scatter plot) appear as a straight line, for example, if diagonal= ' hist ', is the histogram pd.scatter_matrix (trans_data,diagonal = ' KDE ', color = ' k ', alpha = 0.3) Pd.scatter_matrix (trans_data,diagonal = ' hist ', color = ' k ', alpha = 0.3) plt.show ()

Draw a map: Graphically visualize Haiti earthquake crisis data

This is an example.

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltfrom pandas import Series,dataf Ramefrom Mpl_toolkits.basemap Import basemap# The following example should be more comprehensive data = pd.read_csv (' e:\\haiti.csv ') #print data# The following is the date, latitude, longitude #print data[[' INCIDENT date ', ' LATITUDE ', ' longitude ']][:10] #print data[' CATEGORY '][:6] #这些代表消息的类型 # The data is likely to have outliers, missing values, take a look at #print data.describe () #清除错误信息并移除缺失分类信息是 "A simple thing" data = data[(data. LATITUDE >) & (Data. LATITUDE < & (data. Longitude > -75) & (data. Longitude < -70) & data. Category.notnull ()] #我们想根据分类对数据做一些分析或者图形化工作, but multiple classifications may be included in each category field. In addition, the individual classification information # has not only one encoding, but also an English (French) name. Therefore, the data should be normalized.    The following is a two (three) # function, one to get a list of all categories, one used to split the classification information into the code and the English name #sptrip is to delete the white space characters, ' \ n ' and so on; note the author of this implicit cyclic notation def to_cat_list (CATSTR):    Stripped = (X.strip () for x in Catstr.split (', ')] return [x for x in stripped if X]def get_all_categoties (cat_series): Cat_sets = (Set (to_cat_list (x)) for x in cat_series) return sorted (Set.union (*cat_sets)) def get_english (cat): Code,names = Cat.split ('. ') If ' | ' in names:names = Names.split (' | ') [1] return Code,names.strip () #下面进行一下ceshi #print get_english (' 2.Urgences logistiques | Vital Lines ') #接下来做了一个将编码跟名称映射起来的字典, because we're going to have to use coding to analyze it. #下面将所有组合弄出来all_cats = get_all_categoties (data. CATEGORY) #print data. CATEGORY[:10] #print all_cats# Generator expression # build Dictionary english_mapping = dict (get_english (x) for x in all_cats) #print english_mapping [' 2a '] #print english_mapping[' 6c ') #根据分类选取记录的方式有很多, one of which is to add the indicator (or dummy variable) column, one column for each category. #为此, first extract the unique classification code, and construct a full 0 dataframe (classified as a classification code, indexed to the same index as the data) def get_code (seq): Return [X.split ('. ') [0] for x in seq if x] #下面是将所有的key取出来all_codes = Get_code (all_cats) #print all_codescode_index = PD. Index (Np.unique (all_codes)) #print code_indexdummy_frame = DataFrame (Np.zeros (len (data), Len (Code_index)), index = Data.index,columns = Code_index) #print len (data) #print Dummy_frame.ix[:,:6] #下面将各行中适当的项设置为1, and then connect to data: for Row,cat In Zip (data.index,data. CATEGORY): codes = Get_code (To_cat_list (cat)) Dummy_frame.ix[row,codes] = # add prefix, and merge data = Data.join (Dummy_frame.add_prefix (' Category_ ')) #print data# Let's start with the drawing, and we want to map the data to Haiti. The Basemap dataset is a plug-in for Matplotloib # making it possible to draw 2D data on a map with python. Basemap provides a number of different Earth projections and a way to convert the longitude and latitude of the earth to a two-dimensional matplotlib graph. # "Try again and again", the author writes the following function to draw a simple black-and-white map.    def basic_haiti_map (ax = None,lllat = 17.25,urlat = 20.25,lllon = -75,urlon = -71): #创建极球面投影的Basemap实例.        m = basemap (ax = ax,projection = ' Stere ', lon_0 = (Urlon + Lllon)/2, Lat_0 = (Urlat + lllat)/2, Llcrnrlat = Lllat,urcrnrlat = Urlat, Llcrnrlon = Lllon,urcrnrlon = Urlon, Resolution = ' F ')

Because the window under the installation of GEOS is not successful, this part of the Ubuntu installed and then continue to write.

4. Python Graphical tool ecosystem

Describes several other drawing tools.


Features: static diagram + Interactive graphics, ideal for using complex graphical methods to represent the internal relationship of data. The interactive GUI is a good choice for interactive support.


This is a 3D graphics toolkit based on the open source C + + graphics library VTK. can be integrated into Ipython for interactive use.

Other libraries

Other libraries or applications include: PYQWT, Veusz, Gnuplotpy, Biggles, and so on, and large libraries are developing to web-based technologies and moving away from desktop graphics technology.

The future of graphical tools

Web-based technology (such as JavaScript) is the inevitable trend of development, now has a lot of, higncharts and so on.

"Data analysis using Python" reading notes--eighth chapter drawing and visualization

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.