Stroke data wrangling#

Thrombolysis is a clot busting treatment for people suffering from a ischemic stroke: where a blood clot is preventing blood flow to the brain causing it to die. In England, national data on thrombolysis treatment at individual hospitals strokes is collected and stored centrally. This exercise will make use of synthetic dataset based on the real data used in

Allen M, Pearn K, Monks T, Bray BD, Everson R, Salmon A, James M, Stein K (2019). Can clinical audits be enhanced by pathway simulation and machine learning? an example from the acute stroke pathway. BMJ Open, 9(9).

The data you are presented with in a data science or machine learning study nearly always requires a preprocessing step. This may include wrangling the data into a format suitable for machine learning, understanding (and perhaps imputing) missing values and cleaning/creation of features. In this exercise you will need to wrangle the stroke thrombolysis dataset.

Exercise 1: Read and initial look#

The dataset is held in synth_lysis.csv.

Task:

  • Read the stroke thrombolysis dataset into a pandas.DataFrame

  • Use appropriate pandas and DataFrame methods and functions to gain an overview of the dataset and the features it contains.

Hints:

  • You might look at:

    • Size of the dataset, feature (field/variable) naming, data types, missing data etc.

import pandas as pd
DATA_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
            + 'hpdm139-datasets/main/synth_lysis.csv'

# your code here ...

Exercise 2: Clean up the feature names#

The naming of features in this dataset is inconsistent. There is mixed capitalisation and use of spaces in variable names. A feature is called label which is not particularly useful. This is the label indicating if a patient recieved thrombolysis or not.

Task:

  • convert all feature names to lower case

  • remove all spaces from names

  • rename label to treated.

Hints:

  • Assuming your DataFrame is called df you can get and set the column names using df.columns

# your code here...

Exercise 3: Create a pre-processing function#

It is useful to cleanly organise your data wrangling code. Let’s create a skeleton of one now before we get into any detailed wrangling.

There are a number of ways we might do this from functions, classes to specialist libraries such as pyjanitor. Here we will prefer our own simple functions.

Task:

  • Create a function wrangle_lysis

  • The function should accept a str parameter specifying the data url or directory path and filename of our thrombolysis data set.

  • For now set the function up to read in the data (from exercise 1) and clean the feature headers (exercise 2).

  • The function should return a pd.DataFrame containing the stroke thrombolysis data.

Hints:

  • Get into the habit of writing a simple docstring for your functions.

This function will come in handy for the later exercises where you may make mistakes and muck up your nicely cleaned datasets! You can just reload and process them with one command after this exercise.

# your code here ...

Exercise 4: Explore categorical features#

A number of features are categorical. For example, male contains two values Y (the patient is male) and N (the patient is not male).

Task:

  • List the categories contained in the following fields:

fields = ['hospital', 'male', 'severity', 'stroke_type', 'treated']
# your code here ...

Exercise 5: Encode categorical fields with 2 levels#

In exercise 4, you should find that the male and treated columns have two levels (yes and no). If we take male as an example we can encode it as a single feature as follows:

\[\begin{split} x_i = \Bigg\{ \begin{matrix} 1 & \mbox{if }i \mbox{th person is male} \\ 0 & \mbox{if }i \mbox{th person is female}\end{matrix} \end{split}\]

Note: we will deal with stroke_type which has two levels and missing data in exercise 6.

Task:

  • Encode male and treated to be binary 0/1 fields.

  • Update wrangle_lysis to include this code.

Hints

  • Try the pd.get_dummies function.

  • Remember that you only need one variable when a categorical field has two values. You can use the drop_first=True to keep only one variable (just make sure its the right one!).

# your code here ...

Exercise 6: Encoding fields with > 2 categories#

The process to encode features with more than category is almost identical to that used in exercise 6. For example the hospital field contains 8 unique values. There are now two options.

  1. encode as 7 dummy variables where all 0’s indicates hospital 1.

  2. use a one-hot encoding and include 8 variables. The additional degree of freedom allows you to encode missing data as all zeros.

Note that some methods such as ordinary least squares regression require you to take approach one. More flexible machine learning approaches can handle approach 2. Here we will make use of approach 2.

Task:

  • Use a one-hot encoding on the hospital column.

  • Use a one-hot encoding on the stroke_type column. You should prefix the new encoded columns with stroke_type_. I.e. you will have two columns stroke_type_clot and stroke_type_bleed.

Hints:

  • One hot encoding is just the same as calling pd.get_dummies, but we set drop_first=False.

# your code here ...