Case study: Combining multiple sources

Case study: Combining multiple sources#

This case study will work with data taken from NHS England’s open diagnostic imaging dataset from 2019/20.

We will use three datasets

Diagnostic imaging referrals by provider
Diagnostic imaging waiting times by provider
Diagnostic imaging report times by provider.

We are only interested in the annual figures and will combine these into a single dataset. The formatting in the datasets also has a few minor (and annoying) issues, including how missing data is stored, that we will need to sort out before we can combine. The good news is that pandas makes this relatively painless.

After the preprocessing we will create a subset of the data for the South West of England and save it to file.

We will use method chaining to create and preprocess the various DataFrames we will use.

One area we can exploit here is that the datasets are ordered the same. So the same provider and diagnostic imaging type appears in the same row across the datasets. So the task is one of preprocessing and then concatenation by row.

Imports#

import pandas as pd
import numpy as np

Helper functions#

As method chaining will be used and also that we need to limit the data extracted from each dataframe two helper functions will be created. These accept a DataFrame as a parameter and returns a subset of columns.

In method chaining we will use these with the method .pipe(). Don’t worry all should become clear when you see the code.

def extract_index_plus_annual_column(df):
    return df[['region', 'org_code', 'provider', 'imaging_type', 'n_referrals']]

def extract_annual_column(df):
    return df[[' Year ']]

Data URLS#

NREFS_URL = 'https://raw.githubusercontent.com/health-data-science-OR' \
            + '/hpdm139-datasets/main/di_counts.csv'
TEST_WAIT_URL = 'https://raw.githubusercontent.com/health-data-science-OR' \
                + '/hpdm139-datasets/main/di_rq_to_test.csv'
REPT_WAIT_URL = 'https://raw.githubusercontent.com/health-data-science-OR' \
                + '/hpdm139-datasets/main/di_test_to_report.csv'

A look at the individual datasets#

n_refs = pd.read_csv(NREFS_URL)
n_refs.head()

	Region	Org Code	Provider name	Modality	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Jan	Feb	Mar	Year
0	Y56	NT9	Alliance Medical	Computerized Axial Tomography	495	440	495	445	400	400	475	425	390	450	420	310	5,145
1	Y56	NT9	Alliance Medical	Diagnostic Ultrasonography	155	170	120	225	265	220	215	255	230	260	240	135	2,495
2	Y56	NT9	Alliance Medical	Magnetic Resonance Imaging	1,480	1,480	1,365	1,520	1,530	1,485	1,645	1,575	1,515	1,870	1,755	1,515	18,740
3	Y56	NT9	Alliance Medical	Nuclear Medicine Procedure	.	.	.	.	.	.	.	.	.	*	.	.	*
4	Y56	NT9	Alliance Medical	Plain Radiography	65	105	60	75	85	135	105	65	75	100	85	45	990

Download and pre-processing code.#

Number of referrals.#

# organisation info and num referrals
n_refs = (pd.read_csv(NREFS_URL)
            .rename(columns={' Year ': 'n_referrals',
                             'Region': 'region',
                             'Org Code': 'org_code',
                             'Provider name': 'provider',
                             'Modality': 'imaging_type'})
            # strip out white space
            .assign(org_code=lambda x: x['org_code'].str.strip(),
                    imaging_type=lambda x: x['imaging_type'].str.strip(),
                    provider=lambda x: x['provider'].str.strip(),
                    region=lambda x: x['region'].str.strip())
            .pipe(extract_index_plus_annual_column)
            .rename(columns={' Year ': 'n_referrals'})
            .replace(['*', ' * '], np.nan)
            .assign(n_referrals=lambda x: x['n_referrals'].str.strip())
            .assign(n_referrals=lambda x: x['n_referrals'].str.replace(',', ''))
            # deal with NaN -> to float and then nullable int (or keep float)
            # NaN is a float.  Nullable int use pd.Int64Dtype()
            .astype({'n_referrals': np.float})
            .astype({'n_referrals': pd.Int32Dtype()})
         )

n_refs.head(2)

	region	org_code	provider	imaging_type	n_referrals
0	Y56	NT9	Alliance Medical	Computerized Axial Tomography	5145
1	Y56	NT9	Alliance Medical	Diagnostic Ultrasonography	2495

Request to diagnostic imaging.#

# days from request to imaging
rtt = (pd.read_csv(TEST_WAIT_URL)
         .pipe(extract_annual_column)
         .rename(columns={' Year ': 'mdn_days_rtt'})
         .replace(['*', ' * ', '.'], np.nan)
         .astype({'mdn_days_rtt': np.float})
        )

rtt.head(2)

	mdn_days_rtt
0	1.0
1	10.0

Imaging to reporting of results#

# days from imaging to report
ttr = (pd.read_csv(REPT_WAIT_URL)
         .pipe(extract_annual_column)
         .rename(columns={' Year ': 'mdn_days_ttr'})
         .replace(['*', ' * ', '.'], np.nan)
         .astype({'mdn_days_ttr': 'float'})
      )

ttr.head(2)

	mdn_days_ttr
0	5.0
1	0.0

Combine the three sources#

# combined dataset
imaging_df = pd.concat([counts, rtt, ttr], axis=1)
imaging_df.head()

	region	org_code	provider	imaging_type	n_referrals	mdn_days_rtt	mdn_days_ttr
0	Y56	NT9	Alliance Medical	Computerized Axial Tomography	5145	1.0	5.0
1	Y56	NT9	Alliance Medical	Diagnostic Ultrasonography	2495	10.0	0.0
2	Y56	NT9	Alliance Medical	Magnetic Resonance Imaging	18740	6.0	3.0
3	Y56	NT9	Alliance Medical	Nuclear Medicine Procedure	<NA>	NaN	NaN
4	Y56	NT9	Alliance Medical	Plain Radiography	990	50.0	NaN

imaging_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325 entries, 0 to 1324
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   region        1325 non-null   object 
 1   org_code      1325 non-null   object 
 2   provider      1325 non-null   object 
 3   imaging_type  1325 non-null   object 
 4   n_referrals   1303 non-null   Int32  
 5   mdn_days_rtt  1273 non-null   float64
 6   mdn_days_ttr  1281 non-null   float64
dtypes: Int32(1), float64(2), object(4)
memory usage: 68.7+ KB

imaging_df.shape

(1325, 7)

Create South West dataset#

south_west = ['RB2', 'RK9', 'RA9', 'REF', 'RH8']
sw_imaging = imaging_df[imaging_df['org_code'].isin(south_west)]
sw_imaging.head(3)

	region	org_code	provider	imaging_type	n_referrals	mdn_days_rtt
275	Y58	REF	Royal Cornwall Hospitals NHS Trust	Computerized Axial Tomography	46160	3.0
276	Y58	REF	Royal Cornwall Hospitals NHS Trust	Diagnostic Ultrasonography	72985	14.0
277	Y58	REF	Royal Cornwall Hospitals NHS Trust	Fluoroscopy	12320	0.0

sw_imaging.shape

(31, 7)

sw_imaging.to_csv('sw_imaging.csv', index=False)