Case study: Combining multiple sources#
This case study will work with data taken from NHS England’s open diagnostic imaging dataset from 2019/20.
We will use three datasets
Diagnostic imaging referrals by provider
Diagnostic imaging waiting times by provider
Diagnostic imaging report times by provider.
We are only interested in the annual figures and will combine these into a single dataset. The formatting in the datasets also has a few minor (and annoying) issues, including how missing data is stored, that we will need to sort out before we can combine. The good news is that pandas
makes this relatively painless.
After the preprocessing we will create a subset of the data for the South West of England and save it to file.
We will use method chaining to create and preprocess the various DataFrames
we will use.
One area we can exploit here is that the datasets are ordered the same. So the same provider and diagnostic imaging type appears in the same row across the datasets. So the task is one of preprocessing and then concatenation by row.
Imports#
import pandas as pd
import numpy as np
Helper functions#
As method chaining will be used and also that we need to limit the data extracted from each dataframe two helper functions will be created. These accept a DataFrame
as a parameter and returns a subset of columns.
In method chaining we will use these with the method
.pipe()
. Don’t worry all should become clear when you see the code.
def extract_index_plus_annual_column(df):
return df[['region', 'org_code', 'provider', 'imaging_type', 'n_referrals']]
def extract_annual_column(df):
return df[[' Year ']]
Data URLS#
NREFS_URL = 'https://raw.githubusercontent.com/health-data-science-OR' \
+ '/hpdm139-datasets/main/di_counts.csv'
TEST_WAIT_URL = 'https://raw.githubusercontent.com/health-data-science-OR' \
+ '/hpdm139-datasets/main/di_rq_to_test.csv'
REPT_WAIT_URL = 'https://raw.githubusercontent.com/health-data-science-OR' \
+ '/hpdm139-datasets/main/di_test_to_report.csv'
A look at the individual datasets#
n_refs = pd.read_csv(NREFS_URL)
n_refs.head()
Region | Org Code | Provider name | Modality | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Jan | Feb | Mar | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Y56 | NT9 | Alliance Medical | Computerized Axial Tomography | 495 | 440 | 495 | 445 | 400 | 400 | 475 | 425 | 390 | 450 | 420 | 310 | 5,145 |
1 | Y56 | NT9 | Alliance Medical | Diagnostic Ultrasonography | 155 | 170 | 120 | 225 | 265 | 220 | 215 | 255 | 230 | 260 | 240 | 135 | 2,495 |
2 | Y56 | NT9 | Alliance Medical | Magnetic Resonance Imaging | 1,480 | 1,480 | 1,365 | 1,520 | 1,530 | 1,485 | 1,645 | 1,575 | 1,515 | 1,870 | 1,755 | 1,515 | 18,740 |
3 | Y56 | NT9 | Alliance Medical | Nuclear Medicine Procedure | . | . | . | . | . | . | . | . | . | * | . | . | * |
4 | Y56 | NT9 | Alliance Medical | Plain Radiography | 65 | 105 | 60 | 75 | 85 | 135 | 105 | 65 | 75 | 100 | 85 | 45 | 990 |
Download and pre-processing code.#
Number of referrals.#
# organisation info and num referrals
n_refs = (pd.read_csv(NREFS_URL)
.rename(columns={' Year ': 'n_referrals',
'Region': 'region',
'Org Code': 'org_code',
'Provider name': 'provider',
'Modality': 'imaging_type'})
# strip out white space
.assign(org_code=lambda x: x['org_code'].str.strip(),
imaging_type=lambda x: x['imaging_type'].str.strip(),
provider=lambda x: x['provider'].str.strip(),
region=lambda x: x['region'].str.strip())
.pipe(extract_index_plus_annual_column)
.rename(columns={' Year ': 'n_referrals'})
.replace(['*', ' * '], np.nan)
.assign(n_referrals=lambda x: x['n_referrals'].str.strip())
.assign(n_referrals=lambda x: x['n_referrals'].str.replace(',', ''))
# deal with NaN -> to float and then nullable int (or keep float)
# NaN is a float. Nullable int use pd.Int64Dtype()
.astype({'n_referrals': np.float})
.astype({'n_referrals': pd.Int32Dtype()})
)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[5], line 20
1 # organisation info and num referrals
2 n_refs = (pd.read_csv(NREFS_URL)
3 .rename(columns={' Year ': 'n_referrals',
4 'Region': 'region',
5 'Org Code': 'org_code',
6 'Provider name': 'provider',
7 'Modality': 'imaging_type'})
8 # strip out white space
9 .assign(org_code=lambda x: x['org_code'].str.strip(),
10 imaging_type=lambda x: x['imaging_type'].str.strip(),
11 provider=lambda x: x['provider'].str.strip(),
12 region=lambda x: x['region'].str.strip())
13 .pipe(extract_index_plus_annual_column)
14 .rename(columns={' Year ': 'n_referrals'})
15 .replace(['*', ' * '], np.nan)
16 .assign(n_referrals=lambda x: x['n_referrals'].str.strip())
17 .assign(n_referrals=lambda x: x['n_referrals'].str.replace(',', ''))
18 # deal with NaN -> to float and then nullable int (or keep float)
19 # NaN is a float. Nullable int use pd.Int64Dtype()
---> 20 .astype({'n_referrals': np.float})
21 .astype({'n_referrals': pd.Int32Dtype()})
22 )
File ~/miniforge3/envs/hds_code/lib/python3.11/site-packages/numpy/__init__.py:394, in __getattr__(attr)
389 warnings.warn(
390 f"In the future `np.{attr}` will be defined as the "
391 "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
393 if attr in __former_attrs__:
--> 394 raise AttributeError(__former_attrs__[attr])
396 if attr in __expired_attributes__:
397 raise AttributeError(
398 f"`np.{attr}` was removed in the NumPy 2.0 release. "
399 f"{__expired_attributes__[attr]}"
400 )
AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
n_refs.head(2)
region | org_code | provider | imaging_type | n_referrals | |
---|---|---|---|---|---|
0 | Y56 | NT9 | Alliance Medical | Computerized Axial Tomography | 5145 |
1 | Y56 | NT9 | Alliance Medical | Diagnostic Ultrasonography | 2495 |
Request to diagnostic imaging.#
# days from request to imaging
rtt = (pd.read_csv(TEST_WAIT_URL)
.pipe(extract_annual_column)
.rename(columns={' Year ': 'mdn_days_rtt'})
.replace(['*', ' * ', '.'], np.nan)
.astype({'mdn_days_rtt': np.float})
)
rtt.head(2)
mdn_days_rtt | |
---|---|
0 | 1.0 |
1 | 10.0 |
Imaging to reporting of results#
# days from imaging to report
ttr = (pd.read_csv(REPT_WAIT_URL)
.pipe(extract_annual_column)
.rename(columns={' Year ': 'mdn_days_ttr'})
.replace(['*', ' * ', '.'], np.nan)
.astype({'mdn_days_ttr': 'float'})
)
ttr.head(2)
mdn_days_ttr | |
---|---|
0 | 5.0 |
1 | 0.0 |
Combine the three sources#
# combined dataset
imaging_df = pd.concat([counts, rtt, ttr], axis=1)
imaging_df.head()
region | org_code | provider | imaging_type | n_referrals | mdn_days_rtt | mdn_days_ttr | |
---|---|---|---|---|---|---|---|
0 | Y56 | NT9 | Alliance Medical | Computerized Axial Tomography | 5145 | 1.0 | 5.0 |
1 | Y56 | NT9 | Alliance Medical | Diagnostic Ultrasonography | 2495 | 10.0 | 0.0 |
2 | Y56 | NT9 | Alliance Medical | Magnetic Resonance Imaging | 18740 | 6.0 | 3.0 |
3 | Y56 | NT9 | Alliance Medical | Nuclear Medicine Procedure | <NA> | NaN | NaN |
4 | Y56 | NT9 | Alliance Medical | Plain Radiography | 990 | 50.0 | NaN |
imaging_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325 entries, 0 to 1324
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 region 1325 non-null object
1 org_code 1325 non-null object
2 provider 1325 non-null object
3 imaging_type 1325 non-null object
4 n_referrals 1303 non-null Int32
5 mdn_days_rtt 1273 non-null float64
6 mdn_days_ttr 1281 non-null float64
dtypes: Int32(1), float64(2), object(4)
memory usage: 68.7+ KB
imaging_df.shape
(1325, 7)
Create South West dataset#
south_west = ['RB2', 'RK9', 'RA9', 'REF', 'RH8']
sw_imaging = imaging_df[imaging_df['org_code'].isin(south_west)]
sw_imaging.head(3)
region | org_code | provider | imaging_type | n_referrals | mdn_days_rtt | mdn_days_ttr | |
---|---|---|---|---|---|---|---|
275 | Y58 | REF | Royal Cornwall Hospitals NHS Trust | Computerized Axial Tomography | 46160 | 3.0 | 0.0 |
276 | Y58 | REF | Royal Cornwall Hospitals NHS Trust | Diagnostic Ultrasonography | 72985 | 14.0 | 0.0 |
277 | Y58 | REF | Royal Cornwall Hospitals NHS Trust | Fluoroscopy | 12320 | 0.0 | 0.0 |
sw_imaging.shape
(31, 7)
sw_imaging.to_csv('sw_imaging.csv', index=False)