Managing time series#
Time series data (and time series data science problems) involve the observation of one or more variables over time. Time series data are highly prevalent in electronic health records held by health delivery organisations such as the UK NHS and they frequently make analysis more difficult! Here are a few example data science problems you might encounter which are going to require you to work with one or more time series.
Evaluating if a change to a service improved the performance of an emergency department.
Forecasting the weekly demand that a outpatient clinic will experience over the next 3 months
Predicting an child’s length of stay in critical care based on vital sign observations and interventions they underwent during their journey to hospital.
If not done correctly, dates and times can be painful to use in coding! The good news is the numpy
and pandas
make handling dates the easiest it has ever been.
In this section we will learn how:
numpy
andpandas
store date time data of different frequencies;to manipulate dates and times in
numpy
andpandas
to quickly read time series data into
pandas.DataFrame
to setup a
pandas.DataFrame
with aDataTimeIndex
import numpy as np
import pandas as pd
numpy.datatime64
#
The numpy
data type to manage datetimes is called datetime64
.
Static arrays of datetime64
#
To create an array containing date time data statically is very straightforward. We use the normal array syntax and include date in YYYY-MM-DD format. You also need to specify the dtype
as 'datetime64'
np.array(['2019-07-11', '2019-07-12', '2019-07-13', '2019-07-14'],
dtype='datetime64')
array(['2019-07-11', '2019-07-12', '2019-07-13', '2019-07-14'],
dtype='datetime64[D]')
Notice that the np.ndarray
has dtype=datetime64[D]
. The ‘D’ stands for the minimum unit of days.
Consider an alternative where we include hours. You need to include the letter ‘T’ (for timestamp) in string pass to the numpy.ndarray
np.array(['2019-07-11T01', '2019-07-12T02', '2019-07-13T03', '2019-07-14T17'],
dtype='datetime64')
array(['2019-07-11T01', '2019-07-12T02', '2019-07-13T03', '2019-07-14T17'],
dtype='datetime64[h]')
That time the dtype=datetime64[h]
where ‘h’ stands for hours. We can go further and try minutes.
np.array(['2019-07-11T00:13', '2019-07-12T00:15', '2019-07-13T00:15', '2019-07-14T00:05'],
dtype='datetime64')
array(['2019-07-11T00:13', '2019-07-12T00:15', '2019-07-13T00:15',
'2019-07-14T00:05'], dtype='datetime64[m]')
And now try seconds
np.array(['2019-07-11T00:13:59', '2019-07-12T00:15:30', '2019-07-13T00:15:20',
'2019-07-14T00:05:15'], dtype='datetime64')
array(['2019-07-11T00:13:59', '2019-07-12T00:15:30',
'2019-07-13T00:15:20', '2019-07-14T00:05:15'],
dtype='datetime64[s]')
and miliseconds
np.array(['2019-07-11T00:13:59.100', '2019-07-12T00:15:30.189'],
dtype='datetime64')
array(['2019-07-11T00:13:59.100', '2019-07-12T00:15:30.189'],
dtype='datetime64[ms]')
Quick creation of date arrays using np.arange
#
np.arange(start,stop,step)
(where stop is exclusive) is commonly used to produce an array of integers can be used to produce an array of evenly-spaced integers (particularly good for generating synthetic testing data).
np.arange
can also be used to generate a range of date time stamps. For example:
To experiment try changing the step argument to a different value
np.arange('2019-07-01', '2019-07-31', step=3, dtype='datetime64[D]')
array(['2019-07-01', '2019-07-04', '2019-07-07', '2019-07-10',
'2019-07-13', '2019-07-16', '2019-07-19', '2019-07-22',
'2019-07-25', '2019-07-28'], dtype='datetime64[D]')
Here is a bigger array:
foo = np.arange('2019-07-01', '2019-07-31', step=7, dtype='datetime64[m]')
foo.shape
(6172,)
To get all values in between two dates then omit the step argument. The below generates days between 1st and 10th August
np.arange('2019-07-01', '2019-07-10', dtype='datetime64[D]')
array(['2019-07-01', '2019-07-02', '2019-07-03', '2019-07-04',
'2019-07-05', '2019-07-06', '2019-07-07', '2019-07-08',
'2019-07-09'], dtype='datetime64[D]')
Date Time Index in pandas
#
pandas
uses a DateTimeIndex
that builds on numpy datetime64
data type. pandas
is definitely the easiest way to work with time series data in Python. One of the reasons for this is that pandas can detect and handle different formats of date strings in input files. Always watch out for problems with US -> UK dates and vice versa.
Static creation#
If you need to create some synthetic data for testing then you can use the pandas.date_range
function.
#note that by default pandas will assume the below is MM/DD/YYY
index = pd.date_range('1/3/2019', periods=7, freq='D')
index
DatetimeIndex(['2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06',
'2019-01-07', '2019-01-08', '2019-01-09'],
dtype='datetime64[ns]', freq='D')
A hourly date range
index = pd.date_range('1/1/2019', periods=7, freq='h')
index
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 01:00:00',
'2019-01-01 02:00:00', '2019-01-01 03:00:00',
'2019-01-01 04:00:00', '2019-01-01 05:00:00',
'2019-01-01 06:00:00'],
dtype='datetime64[ns]', freq='H')
A ‘monthly start’ range.
index = pd.date_range('1/1/2019', periods=7, freq='MS')
index
DatetimeIndex(['2019-01-01', '2019-02-01', '2019-03-01', '2019-04-01',
'2019-05-01', '2019-06-01', '2019-07-01'],
dtype='datetime64[ns]', freq='MS')
Convert a list to datetime index
dates = ['1/1/2019', '2/1/2019', '3/1/2019']
index = pd.DatetimeIndex(dates)
index
DatetimeIndex(['2019-01-01', '2019-02-01', '2019-03-01'], dtype='datetime64[ns]', freq=None)
US to UK problems
dates = ['1/1/2019', '2/1/2019', '3/1/2019']
index = pd.DatetimeIndex(dates, dayfirst=True)
index
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03'], dtype='datetime64[ns]', freq=None)
Convert numpy array to datetime index
For data manipulation and analysis I often find myself moving between NumPy arrays and pandas DataFrames.
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates)
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[ns]', freq=None)
Note that in the example above the frequency is None. That’s annoying and there are some forecasting tools in Python that will insist on having a frequency. There are two ways to sort this out.
#pass in the frequency argument
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates, freq='D')
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[ns]', freq='D')
#set the frequency post-hoc
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates)
index.freq = 'D'
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[ns]', freq='D')
Finding the min|max andin a date time index and accessing a TimeStamp
index.min()
Timestamp('2019-07-11 00:00:00', freq='D')
index.max()
Timestamp('2019-07-13 00:00:00', freq='D')
print(index.min().year)
print(index.min().month)
print(index.min().days_in_month)
2019
7
31
Importing data from a CSV file#
First create a synthetic data set and save to csv
LAMBDA = 30
PERIODS = 365 * 2
idx = pd.date_range('2018-01-01', periods=PERIODS, freq='D')
# representing a count variable of sales og widgets with mean LAMBDA.
sales = np.random.poisson(LAMBDA, size=PERIODS)
df = pd.DataFrame(sales, index=idx)
df.columns = ['sales']
df.index.name = 'date'
df.head()
sales | |
---|---|
date | |
2018-01-01 | 24 |
2018-01-02 | 33 |
2018-01-03 | 24 |
2018-01-04 | 21 |
2018-01-05 | 24 |
df.to_csv('data/example_data1.csv') # save to file
Now read in the data and let pandas know that the index is a date field using the parse_dates
argument.
df = pd.read_csv('data/example_data1.csv', index_col='date', parse_dates=True)
#you have to set this manually
df.index.freq = 'D'
df.index
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
'2018-01-09', '2018-01-10',
...
'2019-12-22', '2019-12-23', '2019-12-24', '2019-12-25',
'2019-12-26', '2019-12-27', '2019-12-28', '2019-12-29',
'2019-12-30', '2019-12-31'],
dtype='datetime64[ns]', name='date', length=730, freq='D')
df.head(10)
sales | |
---|---|
date | |
2018-01-01 | 24 |
2018-01-02 | 33 |
2018-01-03 | 24 |
2018-01-04 | 21 |
2018-01-05 | 24 |
2018-01-06 | 34 |
2018-01-07 | 39 |
2018-01-08 | 25 |
2018-01-09 | 21 |
2018-01-10 | 26 |