Managing time series#

Time series data (and time series data science problems) involve the observation of one or more variables over time. Time series data are highly prevalent in electronic health records held by health delivery organisations such as the UK NHS and they frequently make analysis more difficult! Here are a few example data science problems you might encounter which are going to require you to work with one or more time series.

  • Evaluating if a change to a service improved the performance of an emergency department.

  • Forecasting the weekly demand that a outpatient clinic will experience over the next 3 months

  • Predicting an child’s length of stay in critical care based on vital sign observations and interventions they underwent during their journey to hospital.

If not done correctly, dates and times can be painful to use in coding! The good news is the numpy and pandas make handling dates the easiest it has ever been.

In this section we will learn how:

  • numpy and pandas store date time data of different frequencies;

  • to manipulate dates and times in numpy and pandas

  • to quickly read time series data into pandas.DataFrame

  • to setup a pandas.DataFrame with a DataTimeIndex

import numpy as np
import pandas as pd

numpy.datatime64#

The numpy data type to manage datetimes is called datetime64.

Static arrays of datetime64#

To create an array containing date time data statically is very straightforward. We use the normal array syntax and include date in YYYY-MM-DD format. You also need to specify the dtype as 'datetime64'

np.array(['2019-07-11', '2019-07-12', '2019-07-13', '2019-07-14'], 
         dtype='datetime64')
array(['2019-07-11', '2019-07-12', '2019-07-13', '2019-07-14'],
      dtype='datetime64[D]')

Notice that the np.ndarray has dtype=datetime64[D]. The ‘D’ stands for the minimum unit of days.

Consider an alternative where we include hours. You need to include the letter ‘T’ (for timestamp) in string pass to the numpy.ndarray

np.array(['2019-07-11T01', '2019-07-12T02', '2019-07-13T03', '2019-07-14T17'], 
         dtype='datetime64')
array(['2019-07-11T01', '2019-07-12T02', '2019-07-13T03', '2019-07-14T17'],
      dtype='datetime64[h]')

That time the dtype=datetime64[h] where ‘h’ stands for hours. We can go further and try minutes.

np.array(['2019-07-11T00:13', '2019-07-12T00:15', '2019-07-13T00:15', '2019-07-14T00:05'], 
         dtype='datetime64')
array(['2019-07-11T00:13', '2019-07-12T00:15', '2019-07-13T00:15',
       '2019-07-14T00:05'], dtype='datetime64[m]')

And now try seconds

np.array(['2019-07-11T00:13:59', '2019-07-12T00:15:30', '2019-07-13T00:15:20', 
          '2019-07-14T00:05:15'], dtype='datetime64')
array(['2019-07-11T00:13:59', '2019-07-12T00:15:30',
       '2019-07-13T00:15:20', '2019-07-14T00:05:15'],
      dtype='datetime64[s]')

and miliseconds

np.array(['2019-07-11T00:13:59.100', '2019-07-12T00:15:30.189'], 
         dtype='datetime64')
array(['2019-07-11T00:13:59.100', '2019-07-12T00:15:30.189'],
      dtype='datetime64[ms]')

Quick creation of date arrays using np.arange#

np.arange(start,stop,step) (where stop is exclusive) is commonly used to produce an array of integers can be used to produce an array of evenly-spaced integers (particularly good for generating synthetic testing data).

np.arange can also be used to generate a range of date time stamps. For example:

To experiment try changing the step argument to a different value

np.arange('2019-07-01', '2019-07-31', step=3, dtype='datetime64[D]')
array(['2019-07-01', '2019-07-04', '2019-07-07', '2019-07-10',
       '2019-07-13', '2019-07-16', '2019-07-19', '2019-07-22',
       '2019-07-25', '2019-07-28'], dtype='datetime64[D]')

Here is a bigger array:

foo = np.arange('2019-07-01', '2019-07-31', step=7, dtype='datetime64[m]')
foo.shape
(6172,)

To get all values in between two dates then omit the step argument. The below generates days between 1st and 10th August

np.arange('2019-07-01', '2019-07-10', dtype='datetime64[D]')
array(['2019-07-01', '2019-07-02', '2019-07-03', '2019-07-04',
       '2019-07-05', '2019-07-06', '2019-07-07', '2019-07-08',
       '2019-07-09'], dtype='datetime64[D]')

Date Time Index in pandas#

pandas uses a DateTimeIndex that builds on numpy datetime64 data type. pandas is definitely the easiest way to work with time series data in Python. One of the reasons for this is that pandas can detect and handle different formats of date strings in input files. Always watch out for problems with US -> UK dates and vice versa.

Static creation#

If you need to create some synthetic data for testing then you can use the pandas.date_range function.

#note that by default pandas will assume the below is MM/DD/YYY
index = pd.date_range('1/3/2019', periods=7, freq='D')
index
DatetimeIndex(['2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06',
               '2019-01-07', '2019-01-08', '2019-01-09'],
              dtype='datetime64[ns]', freq='D')
  • A hourly date range

index = pd.date_range('1/1/2019', periods=7, freq='h')
index
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 01:00:00',
               '2019-01-01 02:00:00', '2019-01-01 03:00:00',
               '2019-01-01 04:00:00', '2019-01-01 05:00:00',
               '2019-01-01 06:00:00'],
              dtype='datetime64[ns]', freq='h')
  • A ‘monthly start’ range.

index = pd.date_range('1/1/2019', periods=7, freq='MS')
index
DatetimeIndex(['2019-01-01', '2019-02-01', '2019-03-01', '2019-04-01',
               '2019-05-01', '2019-06-01', '2019-07-01'],
              dtype='datetime64[ns]', freq='MS')

Convert a list to datetime index

dates = ['1/1/2019', '2/1/2019', '3/1/2019']
index = pd.DatetimeIndex(dates)
index
DatetimeIndex(['2019-01-01', '2019-02-01', '2019-03-01'], dtype='datetime64[ns]', freq=None)

US to UK problems

dates = ['1/1/2019', '2/1/2019', '3/1/2019']
index = pd.DatetimeIndex(dates, dayfirst=True)
index
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03'], dtype='datetime64[ns]', freq=None)

Convert numpy array to datetime index

For data manipulation and analysis I often find myself moving between NumPy arrays and pandas DataFrames.

arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates)
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[s]', freq=None)

Note that in the example above the frequency is None. That’s annoying and there are some forecasting tools in Python that will insist on having a frequency. There are two ways to sort this out.

#pass in the frequency argument
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates, freq='D')
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[s]', freq='D')
#set the frequency post-hoc
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates)
index.freq = 'D'
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[s]', freq='D')

Finding the min|max andin a date time index and accessing a TimeStamp

index.min()
Timestamp('2019-07-11 00:00:00')
index.max()
Timestamp('2019-07-13 00:00:00')
print(index.min().year)
print(index.min().month)
print(index.min().days_in_month)
2019
7
31

Importing data from a CSV file#

First create a synthetic data set and save to csv

LAMBDA = 30
PERIODS = 365 * 2

idx = pd.date_range('2018-01-01', periods=PERIODS, freq='D')

# representing a count variable of sales og widgets with mean LAMBDA.
sales = np.random.poisson(LAMBDA, size=PERIODS) 
df = pd.DataFrame(sales, index=idx)
df.columns = ['sales']
df.index.name = 'date'
df.head()
sales
date
2018-01-01 31
2018-01-02 29
2018-01-03 24
2018-01-04 33
2018-01-05 35
df.to_csv('data/example_data1.csv') # save to file
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[22], line 1
----> 1 df.to_csv('data/example_data1.csv') # save to file

File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/util/_decorators.py:333, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    327 if len(args) > num_allow_args:
    328     warnings.warn(
    329         msg.format(arguments=_format_argument_list(allow_args)),
    330         FutureWarning,
    331         stacklevel=find_stack_level(),
    332     )
--> 333 return func(*args, **kwargs)

File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/core/generic.py:3967, in NDFrame.to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options)
   3956 df = self if isinstance(self, ABCDataFrame) else self.to_frame()
   3958 formatter = DataFrameFormatter(
   3959     frame=df,
   3960     header=header,
   (...)
   3964     decimal=decimal,
   3965 )
-> 3967 return DataFrameRenderer(formatter).to_csv(
   3968     path_or_buf,
   3969     lineterminator=lineterminator,
   3970     sep=sep,
   3971     encoding=encoding,
   3972     errors=errors,
   3973     compression=compression,
   3974     quoting=quoting,
   3975     columns=columns,
   3976     index_label=index_label,
   3977     mode=mode,
   3978     chunksize=chunksize,
   3979     quotechar=quotechar,
   3980     date_format=date_format,
   3981     doublequote=doublequote,
   3982     escapechar=escapechar,
   3983     storage_options=storage_options,
   3984 )

File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/formats/format.py:1014, in DataFrameRenderer.to_csv(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, errors, storage_options)
    993     created_buffer = False
    995 csv_formatter = CSVFormatter(
    996     path_or_buf=path_or_buf,
    997     lineterminator=lineterminator,
   (...)
   1012     formatter=self.fmt,
   1013 )
-> 1014 csv_formatter.save()
   1016 if created_buffer:
   1017     assert isinstance(path_or_buf, StringIO)

File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/formats/csvs.py:251, in CSVFormatter.save(self)
    247 """
    248 Create the writer & save.
    249 """
    250 # apply compression and byte/text conversion
--> 251 with get_handle(
    252     self.filepath_or_buffer,
    253     self.mode,
    254     encoding=self.encoding,
    255     errors=self.errors,
    256     compression=self.compression,
    257     storage_options=self.storage_options,
    258 ) as handles:
    259     # Note: self.encoding is irrelevant here
    260     self.writer = csvlib.writer(
    261         handles.handle,
    262         lineterminator=self.lineterminator,
   (...)
    267         quotechar=self.quotechar,
    268     )
    270     self._save()

File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/common.py:749, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    747 # Only for write methods
    748 if "r" not in mode and is_path:
--> 749     check_parent_directory(str(handle))
    751 if compression:
    752     if compression != "zstd":
    753         # compression libraries do not like an explicit text-mode

File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/common.py:616, in check_parent_directory(path)
    614 parent = Path(path).parent
    615 if not parent.is_dir():
--> 616     raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")

OSError: Cannot save file into a non-existent directory: 'data'

Now read in the data and let pandas know that the index is a date field using the parse_dates argument.

df = pd.read_csv('data/example_data1.csv', index_col='date', parse_dates=True)
#you have to set this manually
df.index.freq = 'D'
df.index
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10',
               ...
               '2019-12-22', '2019-12-23', '2019-12-24', '2019-12-25',
               '2019-12-26', '2019-12-27', '2019-12-28', '2019-12-29',
               '2019-12-30', '2019-12-31'],
              dtype='datetime64[ns]', name='date', length=730, freq='D')
df.head(10)
sales
date
2018-01-01 24
2018-01-02 33
2018-01-03 24
2018-01-04 21
2018-01-05 24
2018-01-06 34
2018-01-07 39
2018-01-08 25
2018-01-09 21
2018-01-10 26