Python資料分析及可視化的基本環境

最後更新：2015-05-17 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：

首先搭建基本環境，假設已經有Python運行環境。然後需要裝上一些通用的基本庫，如numpy, scipy用以數值計算，pandas用以資料分析，matplotlib/Bokeh/Seaborn用來資料視覺效果。再按需裝上資料擷取的庫，如Tushare（http://pythonhosted.org/tushare/），Quandl（https://www.quandl.com/）等。網上還有很多可供分析的免費資料集（http://www.kdnuggets.com/datasets/index.html）。另外，最好裝上IPython，比預設Python shell強大許多。

更方便地，可以用Anaconda這樣的Python發行版，它裡麵包含了近200個流行的包。從http://continuum.io/downloads選擇所用平台的安裝包安裝。還覺得麻煩的話用Python Quant Platform。Anaconda裝好後進入IPython應該就可以看到相關資訊了。

[email protected]:~/workspace$ ipython Python 2.7.9 |Anaconda 2.2.0 (64-bit)| (default, Apr 14 2015, 12:54:25) Type "copyright", "credits" or "license" for more information.IPython 3.0.0 -- An enhanced Interactive Python.Anaconda is brought to you by Continuum Analytics.Please check out: http://continuum.io/thanks and https://binstar.org...

Anaconda帶的conda是一個開源包管理器，可以用conda info/list/search查看資訊和已安裝包。要安裝/更新/刪除包可以用conda install/update/remove命令。如：

$ conda install Quandl$ conda install bokeh$ conda update pandas

如果還需要裝些其它庫，比如github上的Python庫，可以用Python的包安裝方式，如pip install和python setup.py --install。不過要注意的是Anaconda安裝路徑是獨立於原系統中的Python環境的。所以要把包安裝到Anaconda那個Python環境的話需要指定下參數，可以先看下Python的包路徑：

$ python -m site --user-site

然後安裝包時指定到該路徑，如：

$ python setup.py install --prefix=~/.local

如果想避免每次開始工作前都輸一坨東西，可以建ipython的profile，在其中進行設定。這樣每次ipython啟動該profile時，相應的環境都自己設定好了。建立名為work的profile：

$ ipython profile create work

然後開啟設定檔~/.ipython/profile_work/ipython_config.py，按具體的需求進行修改，比如自動載入一些常用的包。

c.InteractiveShellApp.pylab = 'auto'...c.TerminalIPythonApp.exec_lines = [     'import numpy as np',     'import pandas as pd'     ...]

如果大多數時候都要到該profile下工作的話可以在~/.bashrc裡加上下面語句：

alias ipython='ipython --profile=work'

這樣以後只要敲ipython就OK了。進入ipython shell後要運行python指令碼只需執行%run test.py。

下面以一些財經資料為例舉一些非常trivial的例子：

1. SPY的均線和candlestick圖

from __future__ import print_function, divisionimport numpy as npimport pandas as pdimport datetime as dtimport pandas.io.data as webimport matplotlib.finance as mpfimport matplotlib.dates as mdatesimport matplotlib.mlab as mlabimport matplotlib.pyplot as pltimport matplotlib.font_manager as font_managerstarttime = dt.date(2015,1,1)endtime = dt.date.today()ticker = 'SPY'fh = mpf.fetch_historical_yahoo(ticker, starttime, endtime)r = mlab.csv2rec(fh); fh.close()r.sort()df = pd.DataFrame.from_records(r)quotes = mpf.quotes_historical_yahoo_ohlc(ticker, starttime, endtime)fig, (ax1, ax2) = plt.subplots(2, sharex=True)tdf = df.set_index('date')cdf = tdf['close']cdf.plot(label = "close price", ax=ax1)pd.rolling_mean(cdf, window=30, min_periods=1).plot(label = "30-day moving averages", ax=ax1)pd.rolling_mean(cdf, window=10, min_periods=1).plot(label = "10-day moving averages", ax=ax1)ax1.set_xlabel(r'Date')ax1.set_ylabel(r'Price')ax1.grid(True)props = font_manager.FontProperties(size=10)leg = ax1.legend(loc='lower right', shadow=True, fancybox=True, prop=props)leg.get_frame().set_alpha(0.5)ax1.set_title('%s Daily' % ticker, fontsize=14)mpf.candlestick_ohlc(ax2, quotes, width=0.6)ax2.set_ylabel(r'Price')for ax in ax1, ax2:    fmt = mdates.DateFormatter('%m/%d/%Y')    ax.xaxis.set_major_formatter(fmt)    ax.grid(True)    ax.xaxis_date()    ax.autoscale()fig.autofmt_xdate()fig.tight_layout()plt.setp(plt.gca().get_xticklabels(), rotation=30)plt.show()fig.savefig('SPY.png')

2. 近十年中紐約商業證券交易所（NYMEX）原油期貨價格和黃金價格的線性迴歸關係

from __future__ import print_function, divisionimport numpy as npimport pandas as pdimport datetime as dtimport Quandlimport seaborn as snssns.set(style="darkgrid")token = "???" # Notice: You can get the token by signing up on Quandl (https://www.quandl.com/)starttime = "2005-01-01"endtime = "2015-01-01"interval = "monthly"gold = Quandl.get("BUNDESBANK/BBK01_WT5511", authtoken=token, trim_start=starttime, trim_end=endtime, collapse=interval)nymex_oil_future = Quandl.get("OFDP/FUTURE_CL1", authtoken=token, trim_start=starttime, trim_end=endtime, collapse=interval)brent_oil_future = Quandl.get("CHRIS/ICE_B1", authtoken=token, trim_start=starttime, trim_end=endtime, collapse=interval)#dat = nymex_oil_future.join(brent_oil_future, lsuffix='_a', rsuffix='_b', how='inner')#g = sns.jointplot("Settle_a", "Settle_b", data=dat, kind="reg")dat = gold.join(nymex_oil_future, lsuffix='_a', rsuffix='_b', how='inner')g = sns.jointplot("Value", "Settle", data=dat, kind="reg")

3. 我國三大產業對於GDP的影響

from __future__ import print_function, divisionfrom collections import OrderedDictimport numpy as npimport pandas as pdimport datetime as dtimport tushare as tsfrom bokeh.charts import Bar, output_file, showimport bokeh.plotting as bpdf = ts.get_gdp_contrib()df = df.drop(['industry', 'gdp_yoy'], axis=1)df = df.set_index('year')df = df.sort_index()years = df.index.values.tolist()pri = df['pi'].astype(float).valuessec = df['si'].astype(float).valuester = df['ti'].astype(float).valuescontrib = OrderedDict(Primary=pri, Secondary=sec, Tertiary=ter)years = map(unicode, map(str, years))output_file("stacked_bar.html")bar = Bar(contrib, years, stacked=True, title="Contribution Rate for GDP",         xlabel="Year", ylabel="Contribution Rate(%)")show(bar)

4. 國內滬指，深指等幾大指數分布

# -*- coding: utf-8 -*-from __future__ import unicode_literalsfrom __future__ import print_function, divisionfrom collections import OrderedDictimport pandas as pdimport tushare as tsfrom bokeh.charts import Histogram, output_file, showsh = ts.get_hist_data('sh')sz = ts.get_hist_data('sz')zxb = ts.get_hist_data('zxb')cyb = ts.get_hist_data('cyb')df = pd.concat([sh['close'], sz['close'], zxb['close'], cyb['close']],         axis=1, keys=['sh', 'sz', 'zxb', 'cyb'])fst_idx = -700distributions = OrderedDict(sh=list(sh['close'][fst_idx:]), cyb=list(cyb['close'][fst_idx:]), sz=list(sz['close'][fst_idx:]), zxb=list(zxb['close'][fst_idx:]))df = pd.DataFrame(distributions)col_mapping = {'sh': u'滬指',        'zxb': u'中小板',        'cyb': u'創業版',        'sz': u'深指'}df.rename(columns=col_mapping, inplace=True)output_file("histograms.html")hist = Histogram(df, bins=50, density=False, legend="top_right")show(hist)

5. 選取某三個行業中上市公司若干關鍵計量（市盈率，市淨率等）的相關性

# -*- coding: utf-8 -*-from __future__ import print_function, divisionfrom __future__ import unicode_literalsfrom collections import OrderedDictimport numpy as npimport pandas as pdimport datetime as dtimport seaborn as snsimport tushare as tsfrom bokeh.charts import Bar, output_file, showcls = ts.get_industry_classified()stk = ts.get_stock_basics()cls = cls.set_index('code')tcls = cls[['c_name']]tstk = stk[['pe', 'pb', 'esp', 'bvps']]df = tcls.join(tstk, how='inner')clist = [df.ix[i]['c_name'] for i in xrange(3)]def neq(a, b, eps=1e-6):    return abs(a - b) > epstdf = df.loc[df['c_name'].isin(clist) & neq(df['pe'], 0.0) &         neq(df['pb'], 0.0) & neq(df['esp'], 0.0) &         neq(df['bvps'], 0.0)]col_mapping = {'pe' : u'P/E',        'pb' : u'P/BV',        'esp' : u'EPS',        'bvps' : u'BVPS'}tdf.rename(columns=col_mapping, inplace=True)sns.pairplot(tdf, hue='c_name', size=2.5)

Python資料分析及可視化的基本環境

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More