Managing time series#

Time series data (and time series data science problems) involve the observation of one or more variables over time. Time series data are highly prevalent in electronic health records held by health delivery organisations such as the UK NHS and they frequently make analysis more difficult! Here are a few example data science problems you might encounter which are going to require you to work with one or more time series.

  • Evaluating if a change to a service improved the performance of an emergency department.

  • Forecasting the weekly demand that a outpatient clinic will experience over the next 3 months

  • Predicting an child’s length of stay in critical care based on vital sign observations and interventions they underwent during their journey to hospital.

If not done correctly, dates and times can be painful to use in coding! The good news is the numpy and pandas make handling dates the easiest it has ever been.

In this section we will learn how:

  • numpy and pandas store date time data of different frequencies;

  • to manipulate dates and times in numpy and pandas

  • to quickly read time series data into pandas.DataFrame

  • to setup a pandas.DataFrame with a DataTimeIndex

import numpy as np
import pandas as pd

numpy.datatime64#

The numpy data type to manage datetimes is called datetime64.

Static arrays of datetime64#

To create an array containing date time data statically is very straightforward. We use the normal array syntax and include date in YYYY-MM-DD format. You also need to specify the dtype as 'datetime64'

np.array(['2019-07-11', '2019-07-12', '2019-07-13', '2019-07-14'], 
         dtype='datetime64')
array(['2019-07-11', '2019-07-12', '2019-07-13', '2019-07-14'],
      dtype='datetime64[D]')

Notice that the np.ndarray has dtype=datetime64[D]. The ‘D’ stands for the minimum unit of days.

Consider an alternative where we include hours. You need to include the letter ‘T’ (for timestamp) in string pass to the numpy.ndarray

np.array(['2019-07-11T01', '2019-07-12T02', '2019-07-13T03', '2019-07-14T17'], 
         dtype='datetime64')
array(['2019-07-11T01', '2019-07-12T02', '2019-07-13T03', '2019-07-14T17'],
      dtype='datetime64[h]')

That time the dtype=datetime64[h] where ‘h’ stands for hours. We can go further and try minutes.

np.array(['2019-07-11T00:13', '2019-07-12T00:15', '2019-07-13T00:15', '2019-07-14T00:05'], 
         dtype='datetime64')
array(['2019-07-11T00:13', '2019-07-12T00:15', '2019-07-13T00:15',
       '2019-07-14T00:05'], dtype='datetime64[m]')

And now try seconds

np.array(['2019-07-11T00:13:59', '2019-07-12T00:15:30', '2019-07-13T00:15:20', 
          '2019-07-14T00:05:15'], dtype='datetime64')
array(['2019-07-11T00:13:59', '2019-07-12T00:15:30',
       '2019-07-13T00:15:20', '2019-07-14T00:05:15'],
      dtype='datetime64[s]')

and miliseconds

np.array(['2019-07-11T00:13:59.100', '2019-07-12T00:15:30.189'], 
         dtype='datetime64')
array(['2019-07-11T00:13:59.100', '2019-07-12T00:15:30.189'],
      dtype='datetime64[ms]')

Quick creation of date arrays using np.arange#

np.arange(start,stop,step) (where stop is exclusive) is commonly used to produce an array of integers can be used to produce an array of evenly-spaced integers (particularly good for generating synthetic testing data).

np.arange can also be used to generate a range of date time stamps. For example:

To experiment try changing the step argument to a different value

np.arange('2019-07-01', '2019-07-31', step=3, dtype='datetime64[D]')
array(['2019-07-01', '2019-07-04', '2019-07-07', '2019-07-10',
       '2019-07-13', '2019-07-16', '2019-07-19', '2019-07-22',
       '2019-07-25', '2019-07-28'], dtype='datetime64[D]')

Here is a bigger array:

foo = np.arange('2019-07-01', '2019-07-31', step=7, dtype='datetime64[m]')
foo.shape
(6172,)

To get all values in between two dates then omit the step argument. The below generates days between 1st and 10th August

np.arange('2019-07-01', '2019-07-10', dtype='datetime64[D]')
array(['2019-07-01', '2019-07-02', '2019-07-03', '2019-07-04',
       '2019-07-05', '2019-07-06', '2019-07-07', '2019-07-08',
       '2019-07-09'], dtype='datetime64[D]')

Date Time Index in pandas#

pandas uses a DateTimeIndex that builds on numpy datetime64 data type. pandas is definitely the easiest way to work with time series data in Python. One of the reasons for this is that pandas can detect and handle different formats of date strings in input files. Always watch out for problems with US -> UK dates and vice versa.

Static creation#

If you need to create some synthetic data for testing then you can use the pandas.date_range function.

#note that by default pandas will assume the below is MM/DD/YYY
index = pd.date_range('1/3/2019', periods=7, freq='D')
index
DatetimeIndex(['2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06',
               '2019-01-07', '2019-01-08', '2019-01-09'],
              dtype='datetime64[ns]', freq='D')
  • A hourly date range

index = pd.date_range('1/1/2019', periods=7, freq='h')
index
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 01:00:00',
               '2019-01-01 02:00:00', '2019-01-01 03:00:00',
               '2019-01-01 04:00:00', '2019-01-01 05:00:00',
               '2019-01-01 06:00:00'],
              dtype='datetime64[ns]', freq='H')
  • A ‘monthly start’ range.

index = pd.date_range('1/1/2019', periods=7, freq='MS')
index
DatetimeIndex(['2019-01-01', '2019-02-01', '2019-03-01', '2019-04-01',
               '2019-05-01', '2019-06-01', '2019-07-01'],
              dtype='datetime64[ns]', freq='MS')

Convert a list to datetime index

dates = ['1/1/2019', '2/1/2019', '3/1/2019']
index = pd.DatetimeIndex(dates)
index
DatetimeIndex(['2019-01-01', '2019-02-01', '2019-03-01'], dtype='datetime64[ns]', freq=None)

US to UK problems

dates = ['1/1/2019', '2/1/2019', '3/1/2019']
index = pd.DatetimeIndex(dates, dayfirst=True)
index
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03'], dtype='datetime64[ns]', freq=None)

Convert numpy array to datetime index

For data manipulation and analysis I often find myself moving between NumPy arrays and pandas DataFrames.

arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates)
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[ns]', freq=None)

Note that in the example above the frequency is None. That’s annoying and there are some forecasting tools in Python that will insist on having a frequency. There are two ways to sort this out.

#pass in the frequency argument
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates, freq='D')
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[ns]', freq='D')
#set the frequency post-hoc
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates)
index.freq = 'D'
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[ns]', freq='D')

Finding the min|max andin a date time index and accessing a TimeStamp

index.min()
Timestamp('2019-07-11 00:00:00', freq='D')
index.max()
Timestamp('2019-07-13 00:00:00', freq='D')
print(index.min().year)
print(index.min().month)
print(index.min().days_in_month)
2019
7
31

Importing data from a CSV file#

First create a synthetic data set and save to csv

LAMBDA = 30
PERIODS = 365 * 2

idx = pd.date_range('2018-01-01', periods=PERIODS, freq='D')

# representing a count variable of sales og widgets with mean LAMBDA.
sales = np.random.poisson(LAMBDA, size=PERIODS) 
df = pd.DataFrame(sales, index=idx)
df.columns = ['sales']
df.index.name = 'date'
df.head()
sales
date
2018-01-01 24
2018-01-02 33
2018-01-03 24
2018-01-04 21
2018-01-05 24
df.to_csv('data/example_data1.csv') # save to file

Now read in the data and let pandas know that the index is a date field using the parse_dates argument.

df = pd.read_csv('data/example_data1.csv', index_col='date', parse_dates=True)
#you have to set this manually
df.index.freq = 'D'
df.index
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10',
               ...
               '2019-12-22', '2019-12-23', '2019-12-24', '2019-12-25',
               '2019-12-26', '2019-12-27', '2019-12-28', '2019-12-29',
               '2019-12-30', '2019-12-31'],
              dtype='datetime64[ns]', name='date', length=730, freq='D')
df.head(10)
sales
date
2018-01-01 24
2018-01-02 33
2018-01-03 24
2018-01-04 21
2018-01-05 24
2018-01-06 34
2018-01-07 39
2018-01-08 25
2018-01-09 21
2018-01-10 26