ED data wrangling#

Emergency departments around the world must deal with highly attendance numbers on a daily basis. The following exercises work with multiple ED datasets. You will wrangle the dataset into a useful format using pandas and then visualise the time series using matplotlib.

The data sets used in these exercises are synthetic, but have been generated to reflect real emergency department demand in the United Kingdom.

Imports#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Datasets#

The dataset syn_ts_ed_long.csv contains data from 4 emergency departments in 2014. The data are stored in long (sometimes called tidy) format. You are provided with three columns: date (non unique date time formatted), hosp (int 1-4) and attends (int, daily number of attends at hosp \(i\))

The dataset syn_ts_wide.csv contains the same data in wide format. Each row now represents a unique date and each hospital ED has its own column.

Exercise 1#

Task 1:

  • Read the two datasets into a pandas dataframe and inspect the columns and data so that you understand the dataset description above.

Hints:

  • The URL’s for the datasets are provided below.

# your code here ...

LONG_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
            + 'hpdm139-datasets/main/syn_ts_ed_long.csv'
WIDE_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
            + 'hpdm139-datasets/main/syn_ts_ed_wide.csv'

Exercise 2:#

Assume you have only been provided with syn_ts_ed_long.csv.

Task:

  • Convert the data into wide format.

  • the output of your code should a pd.Dataframe equivalent to syn_ts_ed_wide.csv

  • Make a decision about the appropraite data types for each of the series. For example, by default the attendance column is an int64. Is this sensible? What other type of integer could the hospital columns be stored as?

Advanced Task:

  • Your data wrangling code should make use of chained commands in pandas.

Hints

  • There are various ways to complete this task. You may want to make use of pivot_table.

  • One complication with a pivot is that you end up with a MultiIndex column for the hospital and number of attends. This is not always particularly clear for labelling. An option is to remove the Mulit-index during wrangling. You could explore the of transposing the pd.Dataframe using .T and the .reset_index() to drop the index.

  • You may want to build up your code command by command to help debug as you go along.

  • Don’t forget about data types.

# your code here ...

Exercise 3:#

Now assume that you have been provided with the data in syn_ts_ed_wide.csv

Task:

  • Convert the dataset from wide format to long (tidy) format.

Advanced task

  • Your data wrangling code should make use of chained commands in pandas.

Hints:

  • Investigate the pandas function wide_to_long() or the function melt()

# your code here...

Exercise 4#

We will now move onto visualising the dataset using matplotlib

Task:

  • Using the wide format data, create a line plot of the data for the ED located at hospital 1.

  • Label the y axis ‘Attendances’

  • Label x axis ‘Date’

  • Use a fontsize of 12

  • Provide a background grid for the plot.

  • Save the plot as a .png file with dpi of 300.

Hints

  • Feel free to adapt the plot to improve its appearance using whatever matplotlib options you prefer.

# your code here ...

Exercise 5#

Task:

  • Create a grid of subplots with 1 column and 4 rows. Each subplot should display one of the hospital ED’s.

  • Label each subplot with the appropraite hospital.

  • Provide an overall figure y axis label of ‘ED Attendances’

  • Give the figure and appropriate sizing

Hints:

  • There are several ways to create a grid of subplots. The easiest for this problem is to use the factory function plt.subplots(). Refer back to the matplotlib sections in the book for help.

  • If you are using matplotlib version 3.4 or above you can use fig.supylabel() and figsupxlabel() to set an overall axis label.

# your solution here ...