Python for data analysis, chapter tenth, time series

Source: Internet
Author: User
Tags time zones timedelta

The tenth chapter of the book, "Python For Data Analysis", focuses on the processing of time series data.
Label
1. DateTime object, timestamp object, period object
2. Two special indexes for pandas series and Dataframe object: Datetimeindex and Periodindex
3. Time zone expression and processing
4. Imestamp The frequency concept of object, period object, and its frequency conversion
5. Two frequency conversions: Single time object--asfreq; time series indexed with time object--resample
6. Moving window of the time series (rolling)

#-*-Coding:utf-8-*-
# The tenth chapter of Python for data analysis
# time Series
Import Pandas as PD
Import NumPy as NP
Import time
From datetime import datetime, Timedelta
Import Matplotlib.pyplot as Plt

Start = Time.time ()
Np.random.seed (10)

# 1, date and time data types and tools
# DateTime stores the time in Year-month-date hour:minute:second format
Tnow = DateTime.Now ()
Print (Tnow)
# Each part of a datetime can be called: Date and time
Print (Tnow.year)
Print (Tnow.day)
# Datetime.timedelta can be used to represent a time difference between two DateTime objects in the form of days, seconds, and milliseconds
Print (Tnow-timedelta (1, 10, 1000)) # 1 days 10 seconds 100 microseconds ago
Print (' \ n ')
# 1.1, Conversion of strings and datetime to each other
# (1) STR and strftime can convert a DateTime object to a string
Print (str (tnow)) # Converts all formats by date and time
Print (Tnow.strftime ('%y-%m-%d ')) # Exported by month-date format
Print (")
# (2) Strptime can convert a string to a DateTime object
Print (Datetime.strptime (' 1999/1/2 ', '%y/%m/%d '))
Print (Datetime.strptime (' 1999/1/2 ', '%y/%d/%m '))
Print (")
# The Datetime.strptime method is most accurate in time format parsing, but requires a custom format.
# The Parser.parse method in the third-party package Dateutil can be adapted to the time format resolution, but sometimes problems may occur
From Dateutil.parser Import Parse

Print (' Jan, 1999, 11:23, PM ')
Print (Parse (' Jan, 1999, 11:23, PM '))
Print (")
# All of the above methods parse a single string, and the To_datetime method of pandas can be resolved in groups
TimeString = [' 2012/1/1 ', ' 2008/8/8 ']
DF = Pd.to_datetime (timestring)
Print (DF)
# pd.to-datetime can also process missing values and automatically turn to NAT (not a time)
timestring = TimeString + [None]
DF = Pd.to_datetime (timestring)
Print (DF)
Print ('------------------------------------------------↑, section1\n\n ')
# 2, Time series basics
# Pandas the most basic time series type is a series with a timestamp (timestamp) index
dates = [Datetime.strptime (' 2000/1/' + str (i), '%y/%m/%d ') for I in range (1, 11)]
Print (dates)
TS = PD. Series (Np.random.randn (Ten), index=dates)
Print (TS)
Print (Type (ts.index[1)) # Every index in the time series is timestamp object
Print (")
# 2.1, index, selection, subset construction (time series)
# time series is also a series, so the original series of various indexes, slicing methods, etc. are applicable
Print (Ts[2:3])
Print (")
# Special, time series can also be indexed by date (string)
Print (ts[' 2000/01/05 ')
# for a time series with a large time span, the slicing method is richer
Print (' \ n ')
TS = PD. Series (NP.RANDOM.RANDN (+), Index=pd.date_range (' 1/1/2000 ', periods=1000))
Print (Ts.describe ())
# Slice by year
Print (ts[' 2001 '].describe ())
# Slice through the year
Print (ts[' 2001/01 '].describe ())
# time Slices
Print (ts[' 2002/05/01 ': ' 2002/05/06 ')
Print (' \ n ')
# The above index, slicing method is also applicable to Dataframe
# 2.2, time series with repeating index
date = [' 2001/02/01 ', ' 2001/02/01 ', ' 2001/02/02 ']
TS = PD. Series (Range (3), index=date)
Print (TS)
Print (ts[' 2001/02/01 ']) # Duplicate index return slice
Print (ts[' 2001/02/02 ']) # non-repeating index returns a scalar value
Print (")
# You can eliminate duplicate indexes by aggregating
Print (Ts.groupby (level=0). Count ())
Print ('---------------------------------↑, section2\n\n ')
# 3, date range, frequency, and movement
# See section 6 for details of Resample
# 3.1, Build date range
# Generate a DateTime Index of the specified length with Pd.date_range
# (1) given starting and ending times
index = Pd.date_range (' 2000/1/1 ', ' 2000/2/1 ')
Print (index)
Print (")
# (2) given start time (or end time) plus time length
index = Pd.date_range (start= ' 2000/1/1 ', periods=10) # starting from 2000/1/1, length 10 days
Print (index)
Print (")
index = Pd.date_range (end= ' 2000/1/28 ', periods=10) # until 2000/1/28, length 10 days
Print (index)
# The interval for generating time series is given in the above function, which is the default value of 1 days, which can be explicitly specified by the FREQ keyword
Print (")
Print (Pd.date_range (' 2000/1/1 ', ' 2000/12/28 ', freq= ' BM ')) # BM represents the last working day of the month
# Freq Options: D (daily), B (per day), H (per hour), T or min (per minute), S (per second) ... p314~315
# 3.2, Frequency and date offset
# The frequency code mentioned above can be freely combined in string form
Print (")
Print (Pd.date_range (' 2000/1/10 ', ' 2000/1/15 ', freq= ' 6h30t10s ') # generates time series in 6-hour, 30-minute, 10-second intervals
# WOM date (week of month) can represent certain days of the month
Print (")
Print (Pd.date_range (' 2000/1/10 ', ' 2001/1/10 ', freq= ' Wom-3fri ')) # Third Friday of every month
# 3.3, move (ahead or lag) data
Print (' \ n ')
TS = PD. DataFrame (
{' A ': Range (6),
' B ': Np.random.randn (6)}
)
Ts.index = Pd.date_range (' 2000/10/1 ', ' 2000/10/6 ')
Print (TS)
Print (Ts.shift (2)) # Data goes 2 days in the direction of time increase
Print (Ts.shift (-2)) # Moving in the direction of the small
# The above function only passes through the size, then move the data
# If you also pass in frequency, move time index
Print (Ts.shift (1, freq= ' M ')) # data does not move, time series goes all the last day of the month
Print (Ts.shift (1, freq= ' 3D ')) # Data not moving, time series all over 3 days
# Shift's biggest function is to calculate the percentage change of the data
Print (")
Print (Ts/ts.shift (1)-1)
# Time offset can also act directly on timestamp and datetime object
# import Time Offset first
From Pandas.tseries.offsets import Day, Bmonthbegin

Print (Tnow + 3 * Day ()) # Offset to the forward direction of the time 3 days
Print (Tnow + bmonthbegin ()) # Offset to the first business day of the next month
Print ('------------------------------------↑, section 3\n\n ')

# 4, Time zone processing
# UTC, coordinated world time. The time zone is represented by the offset in UTC
# 4.1, localization and conversion
# first create a time series in the previous method
TS = PD. Series (range (2))
Ts.index = Pd.date_range (Tnow, periods=2, Normalize=true)
Print (TS)
# The time zone is not specified during the creation of the above, the default time zone is None
Print (' \ntime zone is%s '% Ts.index.tz)
# Add timezone data to the time series, called Localization. There are two ways.
# (1) Tz_localize method
Ts_china = ts.tz_localize (' Asia/shanghai ')
Print (Ts_china) # time index becomes +UTC offset format for local times
Print (TS_CHINA.INDEX.TZ) # The time zone of the timeseries has been attached to the Shanghai
# (2) display the specified time zone directly when creating a sequence
Ts_china = PD. Series (Range (2), Index=pd.date_range (Tnow, periods=2, Normalize=true, tz= ' Asia/shanghai '))
Print (Ts_china)
Print (TS_CHINA.INDEX.TZ)
# time zones can be converted by the Tz_convert method
Print (Ts_china.tz_convert (' UTC ')) # GMT conversion to UTC time
Print (' \ n ')
# datetime, timestamp, datetimeindex these objects can all use Tz_localize, Tz_convert these methods
# 4.2, Operation time zone consciousness type timestamp Object
Stamp = PD. Timestamp (str (tnow))
Stamp_china = stamp.tz_localize (' Asia/shanghai ')
Print (Stamp_china)
STAMP_UTC = Stamp_china.tz_convert (' UTC ')
Print (STAMP_UTC)
# Timestamp object has a property that holds the UTC timestamp value, which is the time offset of the current time relative to the Unix era (January 1, 1970), in NS units
Print (Stamp_utc.value)
Print (Stamp_china.value) # These two times are stamp in different time zones, so the absolute displacement is equal to the time shift of stamp relative to the Unix era.
Print (' \ n ')
# 4.3. Operations between different time zones
# The operation results of time series in different time zones are displayed in UTC standard
TS = PD. Series (Range (2), Index=pd.date_range (Tnow, periods=2))
Ts1 = ts.tz_localize (' Us/eastern ')
TS2 = ts.tz_localize (' Asia/shanghai ')
Print (Ts1.index)
Print (Ts2.index)
Print ((Ts1 + ts2). Index)
Print ('----------------------------------↑, section4\n\n ')

# 5, Time and arithmetic operations
# A new object, period (period)
# Timestamp Object Indicates a moment, relative, period object is used to denote a period of
# Create a Period object
p = PD. Period (freq= ' A-dec ') # with a cycle year ending in December
Print (P)
# Period Object supports addition and subtraction of integers for displacement
Print (p-2) # 2000 push forward 2 years
Print (P + 1) # 2000 years after the year
# The same frequency period also supports addition and subtraction
Print (PD. Period (2005, freq= ' M ')-PD. Period (freq= ' M ') # 5 years total 60 months
# Period_range method can create a set of time ranges
plist = Pd.period_range (' 2000q1 ', ' 2002q1 ', freq= ' Q ') # from 2000 1 quarter to 2002 1 quarter, quarterly interval
Print (plist)
Print (")
# Index of content with DateTime object is called DateTime index, similarly, there are period index
TS = PD. Series (Np.random.randn (len (plist)), index=plist)
Print (TS)
Print (")
# Period the construction of index can also be done by string conversion
string = [' 2000q1 ', ' 2000q2 ', ' 2000q3 '] # 2000 The first 3 quarters of the year
p = PD. Periodindex (String, freq= ' Q-dec ')
Print (P)
Print (' \ n ')
# 5.1, period of frequency conversion
# Period and period index are available for frequency conversion via Asfreq
p = PD. Period (freq= ' Q-dec ')
Print (P) # 2000 first quarter
Print (P.asfreq (' M ')) # Quarter to month, default last one months
Print (P.asfreq (' M ', how= ' start ') # explicitly specify the first month with the How keyword
Print (P.asfreq (' A ')) # season following year
Print (")
# Similarly, period index or the time series containing period index can do the same
Print (Ts.asfreq (' B ', how= ' start ')) # Period index (quarter-period) in the TS time series is converted to period index, which is the first working day of each quarter
Print (")
# 5.2. Period frequency by quarter
# Q-may may indicate that the year-end is May, i.e. 6-8 quarter, 9-11 Quarter, 12-2 Quarter, 3-5 quarter. The remaining representations and so on
# A single Period object and a set of period (such as period index) can be calculated to represent a moment and become timestamp object through the To_timestamp method
A = PD. Period (' 2000q1 ', freq= ' Q ')
Print (a)
# The timestamp of 9:30 A.M. in the first quarter of 2000 converted to the last third working day of the first quarter of 2000 through the period operation
Tstamp = ((A.asfreq (' B ', ' e ')-2). Asfreq (' H ', ' s ') + 9). Asfreq (' T ', ' s ') + 30
Print (Tstamp) # is still a period object, but a minute-level period
Print (Tstamp.to_timestamp ()) # period 2 timestamp
Print (")
# 5.3, convert timestamp to period (and its reverse process)
# (1) The To_period method can convert timestamp to period
Stamp = Pd.date_range (' 2000/12/1 ', periods=2, freq= ' D ')
Print (Stamp.to_period (' M ')) # Specifies the period frequency for the month
# (2) The To_timestamp method can convert period to timestamp
Print (Stamp.to_period (' M '). To_timestamp ())
Print (")
# 5.4, create period index by array
# a lot of times time data is divided into several sub-sections (such as year, month, and day) that are stored in a few columns of a table, and the Periodindex method can merge such columns into a period index
DF = PD. DataFrame ()
df[' Year ' = [2000] * 4 + [2001] * 4
df[' Month ' = Range (1, 9)
df[' Day ' = Range (22, 30)
Print (DF)
Tindex = PD. Periodindex (year=df[' year '], month=df[' month '), day=df[' Day ', freq= ' D ')
Print (Tindex)
Print ('---------------------------------------↑, section 5 \ n \ nthe ')

# 6, resampling and frequency conversion
# resampling refers to the process of switching from one frequency to another, where the frequency is not the traditional frequency, but rather the FREQ keyword in the time series correlation function
# resampling is divided into 3 types:
# 1, l sampling, the frequency becomes larger, or the period period becomes smaller, such as Q sampling into D (quarter-day)
# 2, reduce the sampling, the frequency becomes smaller, or the period period becomes larger, such as Q sampling into a (quarter-year)
# 3, with period cycle sampling, such as W-mon sampling into w-wed (every Monday, every Wednesday)
# resampling is implemented by the Resample method
# 6.1, Drop sampling
# mister into a minute serial data
TS = PD. Series (range), Index=pd.date_range (' 2010/10/1 ', periods=15, freq= ' T ')
Print (TS)
Print (")
# The Resample method needs to specify which side of the interval is the closed interval (the opposite side is the opening interval), and you need to specify the interval to name the right and left boundaries.
# in the 0.22.0 version (higher estimate also similar) of the pandas, the default is more complex, depending on the freq different, so the insurance is a bit of the keyword display specified.
# The Closed keyword can be used to modify which side is a closed interval, and the label keyword specifies which boundary of the interval is to be named
Print (Ts.resample (' 10min ', closed= ' left ', label= ' left '). Count ())
Print (")
Print (Ts.resample (' 10min ', closed= ' right ', label= ' right '). Count ())
# to show the interval more clearly, you can use the Loffset keyword to offset it. You can actually do the same thing by using shift to the entire time series.
Print (")
Print (Ts.resample (' 10min ', closed= ' right ', label= ' right ', loffset= ' -1s '). Count ())
# in particular, there is a sample in the de-sampling called OHLC sampling, which is specific for financial data, calculates the opening price (open) for each interval, the closing price (close), the highest (high), the lowest (low)
Print (")
Print (Ts.resample (' 5min ', closed= ' right ', label= ' right '). OHLC ())
# Another way to achieve the de-sampling is through the GroupBy method, and the resample each have a suitable scenario
# GROUP BY week
TS = PD. Series (range), Index=pd.date_range (' 2000/10/10 ', periods=100, freq= ' D ')
Print (TS)
Print (")
Print (Ts.groupby (lambda t:t.weekday). Count ())
# 6.2, L sampling and interpolation
# from large-scale sampling to small scale, it is natural to introduce the problem of numerical missing, so the ascending sampling needs interpolation processing
# first create a time series of a weekly frequency
Print (")
TS = PD. Series ([1, 2], Index=pd.date_range (' 2000/1/21 ', periods=2, freq= ' W-fri ')) # Two Friday starting from 2000/1/21
Print (TS)
# sample The TS up to the daily frequency
# If the value is not interpolated, the value is missing and can be interpolated by the forward interpolation (Ffill) and the back interpolation (Bfill)
Print (")
Print (Ts.resample (' D '). Ffill ())
Print (Ts.resample (' D '). Bfill ())
# resample can also achieve both non-liter and non-drop sampling
# If you resample 5 of the data above to 1 per week
Print (")
Print (Ts.resample (' W-mon '). Ffill ())
Print (Ts.resample (' W-mon '). Bfill ())
# 6.3, re-sampling through the period
# The above resampling is for the series of Timestamp index, and resampling can also be performed on the series period index
# (1) period and timestamp, just apply Resample method directly to it.
# First construct a time series of period index
Print (")
TS = PD. Series (range (Ten), Index=pd.period_range (' 2010/1 ', periods=10, freq= ' M '))
Print (TS)
# then pass in a larger period of freq in the resample to complete the de-sampling
ts = ts.resample (' Q '). SUM ()
Print (TS)
Print (")
# (2) period-up sampling, you need to specify which end of the new zone to store the original value
# For example, the year-to-quarter up sampling, the original value in the first quarter or the last quarter need to specify
# specified by keyword convention, ' end ' is placed in the last quarter, ' start ' is placed in the first quarter. The default is ' start '
Print (Ts.resample (' M ', convention= ' End '). Ffill ())
Print (")
Print (Ts.resample (' M '). Ffill ())
Print ('--------------------------↑, section 6\n\n ')

# 7, Time series plotting
# Pandas time series data can be plotted directly using the plot () method, based on the Matplotlib package.
# import data, several U.S. stocks data from 1990 to 2010
STK = Pd.read_csv ('./data_set/stock_px.csv ', Parse_dates=true, index_col=0)
STK = stk[[' AAPL ', ' MSFT ', ' SPX '] # Remove 3 stocks from it
STK = stk.resample (' B '). Ffill () # Resampling by weekday frequency to achieve rule frequency
Print (Stk.describe ())
# The time series can be plotted by using the plot method directly by slicing
Fig, axes = plt.subplots (2, 2) # Figure1
stk[' AAPL '].plot (ax=axes[0, 0]) # variety Slicing
Stk.ix[' 2005 '].plot (Ax=axes[0, 1]) # time Slice
stk[' AAPL '].ix[' 06/2006 ': ' 08/2008 '].plot (ax=axes[1, 0]) # double Slice
# The original data can also be resampled to quarterly data, and then plotted
stk[' AAPL '].resample (' Q-dec '). Ffill (). Plot (ax=axes[1, 1]) # The default is the last weekday of the quarter, and if the data is missing it is populated with data from the previous day
# plt.show () # Uncomment this line to enable the graphing function
Print ('-------------------------------------↑, section 7\n\n ')

# 8, moving window functions
# Move a window function to cut out a subwindow in a long sequence to correlate the amount of statistics
# in the Financial Data Mobile window application more, typically, n daily average line
# 250-day moving average for Apple stock
Fig2, Axes2 = Plt.subplots (2, 2) # Figure2
stk[' AAPL '].plot (ax=axes2[0, 0])
Pd. Series.rolling (stk[' AAPL '), (+). mean (). Plot (ax=axes2[0, 0]) # High-version recommended notation, unlike the book routines
# when rolling accepts little data, it will not return a moving average, which can be specified by the Min_periods keyword
stk[' AAPL '].plot (ax=axes2[0, 1])
Pd. Series.rolling (stk[' AAPL '), +, min_periods=10). mean (). Plot (ax=axes2[0, 1) # Returns the moving average with a minimum of 10 non-NA values
# figure (0,0) and figure (0,1) The difference is reflected in the figure (0,1) faster appearance 250 daily average line
# through rolling can also extend the Shin Sung Extension window average, that is, the window length variable, equivalent to the time series length
Expanding_mean = Lambda ts:pd. Series.rolling (TS, len (TS), Min_periods=1). Mean ()
Expanding_mean (stk[' AAPL ')). Plot (ax=axes2[1, 0]) # Full length EMA
stk[' AAPL '].plot (ax=axes2[1, 0])
# 8.1, exponential weighting function
# Mobile windows are often used with an attenuation factor to give greater weight to recent observations, which can quickly reflect changes in the original data
# using the Ewm method to achieve
Ma = PD. Series.rolling (stk[' AAPL '), +, min_periods=50). Mean () # Average annual line of ownership
ewma = PD. SERIES.EWM (stk[' AAPL '), span=250). Mean ()
Ma.plot (ax=axes2[1, 1], style= '--')
Ewma.plot (ax=axes2[1, 1], style= ': ') # you can see the moving average with the attenuation factor faster inflection point (faster reaction)
# 8.2, two USD move window function
# Some statistical variables use two of data, such as correlation coefficients
SPX = Stk[' SPX ']
AAPL = Stk[' AAPL ']
# Two methods to calculate the percentage change of stock price
SPX_PCTC = Spx/spx.shift (1)-1
AAPL_PCTC = stk[' AAPL '].pct_change ()
FIG3, Axes3 = Plt.subplots (2, 2) # Figure3
# The correlation coefficients in the moving window to calculate the two
Corr = PD. Series.rolling (SPX_PCTC, window=125, min_periods=100). Corr (AAPL_PCTC) # 6-month window period, moving correlation coefficient
Corr.plot (ax=axes3[0,0])
# Most of the time, a single data is used as a criterion to calculate the correlation coefficient between the remaining data and standard data.
# At this point, the incoming series (standard data) and Dataframe (the rest of the data) can be
Corr = PD. Dataframe.rolling (stk[[' AAPL ', ' MSFT ']].pct_change (), window=125, min_periods=60). Corr (SPX_PCTC)
Corr.plot (ax=axes3[0,1])
# 8.3, user-defined mobile window functions
Ten_mean = Lambda Ts:np.mean (sorted (ts, reverse=true) [: 10]) # Calculate the top 10 worth
res = PD. Dataframe.rolling (stk[[' AAPL ', ' MSFT ']], window=125, min_periods=50). Apply (Ten_mean)
Print (RES)
Res.plot (ax=axes3[1,0])
Plt.show ()
In fact, in the higher version of the pandas, rolling usage and groupby are relatively close.
Print ('----------------------total time is%.5f s '% (Time.time ()-start))

Python for data analysis, chapter tenth, time series

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.