Managing time series#
Time series data (and time series data science problems) involve the observation of one or more variables over time. Time series data are highly prevalent in electronic health records held by health delivery organisations such as the UK NHS and they frequently make analysis more difficult! Here are a few example data science problems you might encounter which are going to require you to work with one or more time series.
Evaluating if a change to a service improved the performance of an emergency department.
Forecasting the weekly demand that a outpatient clinic will experience over the next 3 months
Predicting an child’s length of stay in critical care based on vital sign observations and interventions they underwent during their journey to hospital.
If not done correctly, dates and times can be painful to use in coding! The good news is the numpy
and pandas
make handling dates the easiest it has ever been.
In this section we will learn how:
numpy
andpandas
store date time data of different frequencies;to manipulate dates and times in
numpy
andpandas
to quickly read time series data into
pandas.DataFrame
to setup a
pandas.DataFrame
with aDataTimeIndex
import numpy as np
import pandas as pd
numpy.datatime64
#
The numpy
data type to manage datetimes is called datetime64
.
Static arrays of datetime64
#
To create an array containing date time data statically is very straightforward. We use the normal array syntax and include date in YYYY-MM-DD format. You also need to specify the dtype
as 'datetime64'
np.array(['2019-07-11', '2019-07-12', '2019-07-13', '2019-07-14'],
dtype='datetime64')
array(['2019-07-11', '2019-07-12', '2019-07-13', '2019-07-14'],
dtype='datetime64[D]')
Notice that the np.ndarray
has dtype=datetime64[D]
. The ‘D’ stands for the minimum unit of days.
Consider an alternative where we include hours. You need to include the letter ‘T’ (for timestamp) in string pass to the numpy.ndarray
np.array(['2019-07-11T01', '2019-07-12T02', '2019-07-13T03', '2019-07-14T17'],
dtype='datetime64')
array(['2019-07-11T01', '2019-07-12T02', '2019-07-13T03', '2019-07-14T17'],
dtype='datetime64[h]')
That time the dtype=datetime64[h]
where ‘h’ stands for hours. We can go further and try minutes.
np.array(['2019-07-11T00:13', '2019-07-12T00:15', '2019-07-13T00:15', '2019-07-14T00:05'],
dtype='datetime64')
array(['2019-07-11T00:13', '2019-07-12T00:15', '2019-07-13T00:15',
'2019-07-14T00:05'], dtype='datetime64[m]')
And now try seconds
np.array(['2019-07-11T00:13:59', '2019-07-12T00:15:30', '2019-07-13T00:15:20',
'2019-07-14T00:05:15'], dtype='datetime64')
array(['2019-07-11T00:13:59', '2019-07-12T00:15:30',
'2019-07-13T00:15:20', '2019-07-14T00:05:15'],
dtype='datetime64[s]')
and miliseconds
np.array(['2019-07-11T00:13:59.100', '2019-07-12T00:15:30.189'],
dtype='datetime64')
array(['2019-07-11T00:13:59.100', '2019-07-12T00:15:30.189'],
dtype='datetime64[ms]')
Quick creation of date arrays using np.arange
#
np.arange(start,stop,step)
(where stop is exclusive) is commonly used to produce an array of integers can be used to produce an array of evenly-spaced integers (particularly good for generating synthetic testing data).
np.arange
can also be used to generate a range of date time stamps. For example:
To experiment try changing the step argument to a different value
np.arange('2019-07-01', '2019-07-31', step=3, dtype='datetime64[D]')
array(['2019-07-01', '2019-07-04', '2019-07-07', '2019-07-10',
'2019-07-13', '2019-07-16', '2019-07-19', '2019-07-22',
'2019-07-25', '2019-07-28'], dtype='datetime64[D]')
Here is a bigger array:
foo = np.arange('2019-07-01', '2019-07-31', step=7, dtype='datetime64[m]')
foo.shape
(6172,)
To get all values in between two dates then omit the step argument. The below generates days between 1st and 10th August
np.arange('2019-07-01', '2019-07-10', dtype='datetime64[D]')
array(['2019-07-01', '2019-07-02', '2019-07-03', '2019-07-04',
'2019-07-05', '2019-07-06', '2019-07-07', '2019-07-08',
'2019-07-09'], dtype='datetime64[D]')
Date Time Index in pandas
#
pandas
uses a DateTimeIndex
that builds on numpy datetime64
data type. pandas
is definitely the easiest way to work with time series data in Python. One of the reasons for this is that pandas can detect and handle different formats of date strings in input files. Always watch out for problems with US -> UK dates and vice versa.
Static creation#
If you need to create some synthetic data for testing then you can use the pandas.date_range
function.
#note that by default pandas will assume the below is MM/DD/YYY
index = pd.date_range('1/3/2019', periods=7, freq='D')
index
DatetimeIndex(['2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06',
'2019-01-07', '2019-01-08', '2019-01-09'],
dtype='datetime64[ns]', freq='D')
A hourly date range
index = pd.date_range('1/1/2019', periods=7, freq='h')
index
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 01:00:00',
'2019-01-01 02:00:00', '2019-01-01 03:00:00',
'2019-01-01 04:00:00', '2019-01-01 05:00:00',
'2019-01-01 06:00:00'],
dtype='datetime64[ns]', freq='h')
A ‘monthly start’ range.
index = pd.date_range('1/1/2019', periods=7, freq='MS')
index
DatetimeIndex(['2019-01-01', '2019-02-01', '2019-03-01', '2019-04-01',
'2019-05-01', '2019-06-01', '2019-07-01'],
dtype='datetime64[ns]', freq='MS')
Convert a list to datetime index
dates = ['1/1/2019', '2/1/2019', '3/1/2019']
index = pd.DatetimeIndex(dates)
index
DatetimeIndex(['2019-01-01', '2019-02-01', '2019-03-01'], dtype='datetime64[ns]', freq=None)
US to UK problems
dates = ['1/1/2019', '2/1/2019', '3/1/2019']
index = pd.DatetimeIndex(dates, dayfirst=True)
index
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03'], dtype='datetime64[ns]', freq=None)
Convert numpy array to datetime index
For data manipulation and analysis I often find myself moving between NumPy arrays and pandas DataFrames.
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates)
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[s]', freq=None)
Note that in the example above the frequency is None. That’s annoying and there are some forecasting tools in Python that will insist on having a frequency. There are two ways to sort this out.
#pass in the frequency argument
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates, freq='D')
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[s]', freq='D')
#set the frequency post-hoc
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates)
index.freq = 'D'
index
DatetimeIndex(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64[s]', freq='D')
Finding the min|max andin a date time index and accessing a TimeStamp
index.min()
Timestamp('2019-07-11 00:00:00')
index.max()
Timestamp('2019-07-13 00:00:00')
print(index.min().year)
print(index.min().month)
print(index.min().days_in_month)
2019
7
31
Importing data from a CSV file#
First create a synthetic data set and save to csv
LAMBDA = 30
PERIODS = 365 * 2
idx = pd.date_range('2018-01-01', periods=PERIODS, freq='D')
# representing a count variable of sales og widgets with mean LAMBDA.
sales = np.random.poisson(LAMBDA, size=PERIODS)
df = pd.DataFrame(sales, index=idx)
df.columns = ['sales']
df.index.name = 'date'
df.head()
sales | |
---|---|
date | |
2018-01-01 | 31 |
2018-01-02 | 29 |
2018-01-03 | 24 |
2018-01-04 | 33 |
2018-01-05 | 35 |
df.to_csv('data/example_data1.csv') # save to file
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Cell In[22], line 1
----> 1 df.to_csv('data/example_data1.csv') # save to file
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/util/_decorators.py:333, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
327 if len(args) > num_allow_args:
328 warnings.warn(
329 msg.format(arguments=_format_argument_list(allow_args)),
330 FutureWarning,
331 stacklevel=find_stack_level(),
332 )
--> 333 return func(*args, **kwargs)
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/core/generic.py:3967, in NDFrame.to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options)
3956 df = self if isinstance(self, ABCDataFrame) else self.to_frame()
3958 formatter = DataFrameFormatter(
3959 frame=df,
3960 header=header,
(...)
3964 decimal=decimal,
3965 )
-> 3967 return DataFrameRenderer(formatter).to_csv(
3968 path_or_buf,
3969 lineterminator=lineterminator,
3970 sep=sep,
3971 encoding=encoding,
3972 errors=errors,
3973 compression=compression,
3974 quoting=quoting,
3975 columns=columns,
3976 index_label=index_label,
3977 mode=mode,
3978 chunksize=chunksize,
3979 quotechar=quotechar,
3980 date_format=date_format,
3981 doublequote=doublequote,
3982 escapechar=escapechar,
3983 storage_options=storage_options,
3984 )
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/formats/format.py:1014, in DataFrameRenderer.to_csv(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, errors, storage_options)
993 created_buffer = False
995 csv_formatter = CSVFormatter(
996 path_or_buf=path_or_buf,
997 lineterminator=lineterminator,
(...)
1012 formatter=self.fmt,
1013 )
-> 1014 csv_formatter.save()
1016 if created_buffer:
1017 assert isinstance(path_or_buf, StringIO)
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/formats/csvs.py:251, in CSVFormatter.save(self)
247 """
248 Create the writer & save.
249 """
250 # apply compression and byte/text conversion
--> 251 with get_handle(
252 self.filepath_or_buffer,
253 self.mode,
254 encoding=self.encoding,
255 errors=self.errors,
256 compression=self.compression,
257 storage_options=self.storage_options,
258 ) as handles:
259 # Note: self.encoding is irrelevant here
260 self.writer = csvlib.writer(
261 handles.handle,
262 lineterminator=self.lineterminator,
(...)
267 quotechar=self.quotechar,
268 )
270 self._save()
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/common.py:749, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
747 # Only for write methods
748 if "r" not in mode and is_path:
--> 749 check_parent_directory(str(handle))
751 if compression:
752 if compression != "zstd":
753 # compression libraries do not like an explicit text-mode
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/pandas/io/common.py:616, in check_parent_directory(path)
614 parent = Path(path).parent
615 if not parent.is_dir():
--> 616 raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
OSError: Cannot save file into a non-existent directory: 'data'
Now read in the data and let pandas know that the index is a date field using the parse_dates
argument.
df = pd.read_csv('data/example_data1.csv', index_col='date', parse_dates=True)
#you have to set this manually
df.index.freq = 'D'
df.index
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
'2018-01-09', '2018-01-10',
...
'2019-12-22', '2019-12-23', '2019-12-24', '2019-12-25',
'2019-12-26', '2019-12-27', '2019-12-28', '2019-12-29',
'2019-12-30', '2019-12-31'],
dtype='datetime64[ns]', name='date', length=730, freq='D')
df.head(10)
sales | |
---|---|
date | |
2018-01-01 | 24 |
2018-01-02 | 33 |
2018-01-03 | 24 |
2018-01-04 | 21 |
2018-01-05 | 24 |
2018-01-06 | 34 |
2018-01-07 | 39 |
2018-01-08 | 25 |
2018-01-09 | 21 |
2018-01-10 | 26 |