Case study: Combining multiple sources#

This case study will work with data taken from NHS England’s open diagnostic imaging dataset from 2019/20.

We will use three datasets

  • Diagnostic imaging referrals by provider

  • Diagnostic imaging waiting times by provider

  • Diagnostic imaging report times by provider.

We are only interested in the annual figures and will combine these into a single dataset. The formatting in the datasets also has a few minor (and annoying) issues, including how missing data is stored, that we will need to sort out before we can combine. The good news is that pandas makes this relatively painless.

After the preprocessing we will create a subset of the data for the South West of England and save it to file.

We will use method chaining to create and preprocess the various DataFrames we will use.

One area we can exploit here is that the datasets are ordered the same. So the same provider and diagnostic imaging type appears in the same row across the datasets. So the task is one of preprocessing and then concatenation by row.

Imports#

import pandas as pd
import numpy as np

Helper functions#

As method chaining will be used and also that we need to limit the data extracted from each dataframe two helper functions will be created. These accept a DataFrame as a parameter and returns a subset of columns.

In method chaining we will use these with the method .pipe(). Don’t worry all should become clear when you see the code.

def extract_index_plus_annual_column(df):
    return df[['region', 'org_code', 'provider', 'imaging_type', 'n_referrals']]

def extract_annual_column(df):
    return df[[' Year ']]

Data URLS#

NREFS_URL = 'https://raw.githubusercontent.com/health-data-science-OR' \
            + '/hpdm139-datasets/main/di_counts.csv'
TEST_WAIT_URL = 'https://raw.githubusercontent.com/health-data-science-OR' \
                + '/hpdm139-datasets/main/di_rq_to_test.csv'
REPT_WAIT_URL = 'https://raw.githubusercontent.com/health-data-science-OR' \
                + '/hpdm139-datasets/main/di_test_to_report.csv'

A look at the individual datasets#

n_refs = pd.read_csv(NREFS_URL)
n_refs.head()
Region Org Code Provider name Modality Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Year
0 Y56 NT9 Alliance Medical Computerized Axial Tomography 495 440 495 445 400 400 475 425 390 450 420 310 5,145
1 Y56 NT9 Alliance Medical Diagnostic Ultrasonography 155 170 120 225 265 220 215 255 230 260 240 135 2,495
2 Y56 NT9 Alliance Medical Magnetic Resonance Imaging 1,480 1,480 1,365 1,520 1,530 1,485 1,645 1,575 1,515 1,870 1,755 1,515 18,740
3 Y56 NT9 Alliance Medical Nuclear Medicine Procedure . . . . . . . . . * . . *
4 Y56 NT9 Alliance Medical Plain Radiography 65 105 60 75 85 135 105 65 75 100 85 45 990

Download and pre-processing code.#

Number of referrals.#

# organisation info and num referrals
n_refs = (pd.read_csv(NREFS_URL)
            .rename(columns={' Year ': 'n_referrals',
                             'Region': 'region',
                             'Org Code': 'org_code',
                             'Provider name': 'provider',
                             'Modality': 'imaging_type'})
            # strip out white space
            .assign(org_code=lambda x: x['org_code'].str.strip(),
                    imaging_type=lambda x: x['imaging_type'].str.strip(),
                    provider=lambda x: x['provider'].str.strip(),
                    region=lambda x: x['region'].str.strip())
            .pipe(extract_index_plus_annual_column)
            .rename(columns={' Year ': 'n_referrals'})
            .replace(['*', ' * '], np.nan)
            .assign(n_referrals=lambda x: x['n_referrals'].str.strip())
            .assign(n_referrals=lambda x: x['n_referrals'].str.replace(',', ''))
            # deal with NaN -> to float and then nullable int (or keep float)
            # NaN is a float.  Nullable int use pd.Int64Dtype()
            .astype({'n_referrals': np.float})
            .astype({'n_referrals': pd.Int32Dtype()})
         )
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[5], line 20
      1 # organisation info and num referrals
      2 n_refs = (pd.read_csv(NREFS_URL)
      3             .rename(columns={' Year ': 'n_referrals',
      4                              'Region': 'region',
      5                              'Org Code': 'org_code',
      6                              'Provider name': 'provider',
      7                              'Modality': 'imaging_type'})
      8             # strip out white space
      9             .assign(org_code=lambda x: x['org_code'].str.strip(),
     10                     imaging_type=lambda x: x['imaging_type'].str.strip(),
     11                     provider=lambda x: x['provider'].str.strip(),
     12                     region=lambda x: x['region'].str.strip())
     13             .pipe(extract_index_plus_annual_column)
     14             .rename(columns={' Year ': 'n_referrals'})
     15             .replace(['*', ' * '], np.nan)
     16             .assign(n_referrals=lambda x: x['n_referrals'].str.strip())
     17             .assign(n_referrals=lambda x: x['n_referrals'].str.replace(',', ''))
     18             # deal with NaN -> to float and then nullable int (or keep float)
     19             # NaN is a float.  Nullable int use pd.Int64Dtype()
---> 20             .astype({'n_referrals': np.float})
     21             .astype({'n_referrals': pd.Int32Dtype()})
     22          )

File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/numpy/__init__.py:394, in __getattr__(attr)
    389     warnings.warn(
    390         f"In the future `np.{attr}` will be defined as the "
    391         "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
    393 if attr in __former_attrs__:
--> 394     raise AttributeError(__former_attrs__[attr])
    396 if attr in __expired_attributes__:
    397     raise AttributeError(
    398         f"`np.{attr}` was removed in the NumPy 2.0 release. "
    399         f"{__expired_attributes__[attr]}"
    400     )

AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
n_refs.head(2)
region org_code provider imaging_type n_referrals
0 Y56 NT9 Alliance Medical Computerized Axial Tomography 5145
1 Y56 NT9 Alliance Medical Diagnostic Ultrasonography 2495

Request to diagnostic imaging.#

# days from request to imaging
rtt = (pd.read_csv(TEST_WAIT_URL)
         .pipe(extract_annual_column)
         .rename(columns={' Year ': 'mdn_days_rtt'})
         .replace(['*', ' * ', '.'], np.nan)
         .astype({'mdn_days_rtt': np.float})
        )
rtt.head(2)
mdn_days_rtt
0 1.0
1 10.0

Imaging to reporting of results#

# days from imaging to report
ttr = (pd.read_csv(REPT_WAIT_URL)
         .pipe(extract_annual_column)
         .rename(columns={' Year ': 'mdn_days_ttr'})
         .replace(['*', ' * ', '.'], np.nan)
         .astype({'mdn_days_ttr': 'float'})
      )
ttr.head(2)
mdn_days_ttr
0 5.0
1 0.0

Combine the three sources#

# combined dataset
imaging_df = pd.concat([counts, rtt, ttr], axis=1)
imaging_df.head()
region org_code provider imaging_type n_referrals mdn_days_rtt mdn_days_ttr
0 Y56 NT9 Alliance Medical Computerized Axial Tomography 5145 1.0 5.0
1 Y56 NT9 Alliance Medical Diagnostic Ultrasonography 2495 10.0 0.0
2 Y56 NT9 Alliance Medical Magnetic Resonance Imaging 18740 6.0 3.0
3 Y56 NT9 Alliance Medical Nuclear Medicine Procedure <NA> NaN NaN
4 Y56 NT9 Alliance Medical Plain Radiography 990 50.0 NaN
imaging_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325 entries, 0 to 1324
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   region        1325 non-null   object 
 1   org_code      1325 non-null   object 
 2   provider      1325 non-null   object 
 3   imaging_type  1325 non-null   object 
 4   n_referrals   1303 non-null   Int32  
 5   mdn_days_rtt  1273 non-null   float64
 6   mdn_days_ttr  1281 non-null   float64
dtypes: Int32(1), float64(2), object(4)
memory usage: 68.7+ KB
imaging_df.shape
(1325, 7)

Create South West dataset#

south_west = ['RB2', 'RK9', 'RA9', 'REF', 'RH8']
sw_imaging = imaging_df[imaging_df['org_code'].isin(south_west)]
sw_imaging.head(3)
region org_code provider imaging_type n_referrals mdn_days_rtt mdn_days_ttr
275 Y58 REF Royal Cornwall Hospitals NHS Trust Computerized Axial Tomography 46160 3.0 0.0
276 Y58 REF Royal Cornwall Hospitals NHS Trust Diagnostic Ultrasonography 72985 14.0 0.0
277 Y58 REF Royal Cornwall Hospitals NHS Trust Fluoroscopy 12320 0.0 0.0
sw_imaging.shape
(31, 7)
sw_imaging.to_csv('sw_imaging.csv', index=False)