{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "36c8e24b-064a-41b3-b3d5-1bef5e9e1ba0",
   "metadata": {},
   "source": [
    "# Stroke data wrangling\n",
    "\n",
    "Thrombolysis is a clot busting treatment for people suffering from a ischemic stroke: where a blood clot is preventing blood flow to the brain causing it to die. In England, national data on thrombolysis treatment at individual hospitals strokes is collected and stored centrally.   This exercise will make use of synthetic dataset based on the real data used in \n",
    "\n",
    "**Allen M, Pearn K, Monks T, Bray BD, Everson R, Salmon A, James M, Stein K (2019). *Can clinical audits be enhanced by pathway simulation and machine learning? an example from the acute stroke pathway*. BMJ Open, 9(9).**\n",
    "\n",
    "The data you are presented with in a data science or machine learning study nearly always requires a preprocessing step.  This may include wrangling the data into a format suitable for machine learning, understanding (and perhaps imputing) missing values and cleaning/creation of features.  In this exercise you will need to wrangle the stroke thrombolysis dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0af0601e-891a-49da-8e54-e3eccfd87e0c",
   "metadata": {},
   "source": [
    "## Exercise 1: Read and initial look\n",
    "\n",
    "The dataset is held in `synth_lysis.csv`.\n",
    "\n",
    "**Task**:\n",
    "* Read the stroke thrombolysis dataset into a `pandas.DataFrame`\n",
    "* Use appropriate `pandas` and `DataFrame` methods and functions to gain an overview of the dataset and the features it contains.\n",
    "\n",
    "**Hints**:\n",
    "* You might look at:\n",
    "    * Size of the dataset, feature (field/variable) naming, data types, missing data etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f4cf4aaf-af48-4201-9bed-15b8db2de05c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "DATA_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \\\n",
    "            + 'hpdm139-datasets/main/synth_lysis.csv'\n",
    "\n",
    "# your code here ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "74dc4182-bd26-433e-aa31-5df9f15a486e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 2000 entries, 0 to 1999\n",
      "Data columns (total 8 columns):\n",
      " #   Column                Non-Null Count  Dtype \n",
      "---  ------                --------------  ----- \n",
      " 0   hospital              2000 non-null   object\n",
      " 1   Male                  2000 non-null   object\n",
      " 2   Age                   2000 non-null   int64 \n",
      " 3   severity              1753 non-null   object\n",
      " 4   stroke type           1801 non-null   object\n",
      " 5   Comorbidity           915 non-null    object\n",
      " 6   S2RankinBeforeStroke  2000 non-null   int64 \n",
      " 7   label                 2000 non-null   object\n",
      "dtypes: int64(2), object(6)\n",
      "memory usage: 125.1+ KB\n"
     ]
    }
   ],
   "source": [
    "# example solution\n",
    "lysis = pd.read_csv(DATA_URL)\n",
    "\n",
    "# take a look at basic info - size of dataset, features, datatypes, missing data\n",
    "lysis.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "f5747d53-8067-4514-81f0-e08a4441c3c2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hospital</th>\n",
       "      <th>Male</th>\n",
       "      <th>Age</th>\n",
       "      <th>severity</th>\n",
       "      <th>stroke type</th>\n",
       "      <th>Comorbidity</th>\n",
       "      <th>S2RankinBeforeStroke</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Hosp_7</td>\n",
       "      <td>Y</td>\n",
       "      <td>65</td>\n",
       "      <td>Minor</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>N</td>\n",
       "      <td>99</td>\n",
       "      <td>Moderate to severe</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>N</td>\n",
       "      <td>49</td>\n",
       "      <td>NaN</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Hosp_1</td>\n",
       "      <td>N</td>\n",
       "      <td>77</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>clot</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>N</td>\n",
       "      <td>86</td>\n",
       "      <td>Minor</td>\n",
       "      <td>clot</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  hospital Male  Age            severity stroke type   Comorbidity  \\\n",
       "0   Hosp_7    Y   65               Minor        clot           NaN   \n",
       "1   Hosp_8    N   99  Moderate to severe        clot           NaN   \n",
       "2   Hosp_8    N   49                 NaN        clot           NaN   \n",
       "3   Hosp_1    N   77            Moderate        clot  Hypertension   \n",
       "4   Hosp_8    N   86               Minor        clot  Hypertension   \n",
       "\n",
       "   S2RankinBeforeStroke label  \n",
       "0                     0     N  \n",
       "1                     3     N  \n",
       "2                     0     N  \n",
       "3                     0     Y  \n",
       "4                     0     N  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lysis.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41a87e5f-df8a-48a6-92be-39cc11e90643",
   "metadata": {},
   "source": [
    "## Exercise 2: Clean up the feature names\n",
    "\n",
    "The naming of features in this dataset is inconsistent.  There is mixed capitalisation and use of spaces in variable names. A feature is called `label` which is not particularly useful.  This is the label indicating if a patient recieved thrombolysis or not.  \n",
    "\n",
    "**Task**:\n",
    "* convert all feature names to lower case \n",
    "* remove all spaces from names\n",
    "* rename `label` to `treated`.\n",
    "\n",
    "**Hints**:\n",
    "* Assuming your `DataFrame` is called `df` you can get and set the column names using `df.columns`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "44442dfd-94f3-4139-9c1c-475c117982b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code here..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f97c9752-704e-4668-8748-e657f7ae06bb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hospital</th>\n",
       "      <th>male</th>\n",
       "      <th>age</th>\n",
       "      <th>severity</th>\n",
       "      <th>stroke_type</th>\n",
       "      <th>comorbidity</th>\n",
       "      <th>s2rankinbeforestroke</th>\n",
       "      <th>treated</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Hosp_7</td>\n",
       "      <td>Y</td>\n",
       "      <td>65</td>\n",
       "      <td>Minor</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>N</td>\n",
       "      <td>99</td>\n",
       "      <td>Moderate to severe</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>N</td>\n",
       "      <td>49</td>\n",
       "      <td>NaN</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Hosp_1</td>\n",
       "      <td>N</td>\n",
       "      <td>77</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>clot</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>N</td>\n",
       "      <td>86</td>\n",
       "      <td>Minor</td>\n",
       "      <td>clot</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  hospital male  age            severity stroke_type   comorbidity  \\\n",
       "0   Hosp_7    Y   65               Minor        clot           NaN   \n",
       "1   Hosp_8    N   99  Moderate to severe        clot           NaN   \n",
       "2   Hosp_8    N   49                 NaN        clot           NaN   \n",
       "3   Hosp_1    N   77            Moderate        clot  Hypertension   \n",
       "4   Hosp_8    N   86               Minor        clot  Hypertension   \n",
       "\n",
       "   s2rankinbeforestroke treated  \n",
       "0                     0       N  \n",
       "1                     3       N  \n",
       "2                     0       N  \n",
       "3                     0       Y  \n",
       "4                     0       N  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# example solution\n",
    "feature_names = list(lysis.columns)\n",
    "feature_names = [s.lower().replace(' ', '_') for s in feature_names]\n",
    "feature_names[-1] = 'treated'\n",
    "lysis.columns = feature_names\n",
    "lysis.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ff3bae5-26fd-43f4-8ecc-06b2fb0fbfa0",
   "metadata": {},
   "source": [
    "## Exercise 3: Create a pre-processing function\n",
    "\n",
    "It is useful to cleanly organise your data wrangling code.  Let's create a skeleton of one now before we get into any detailed wrangling.\n",
    "\n",
    "There are a number of ways we might do this from functions, classes to specialist libraries such as [`pyjanitor`](https://pyjanitor.readthedocs.io/index.html#).  Here we will prefer our own simple functions. \n",
    "\n",
    "**Task**:\n",
    "* Create a function `wrangle_lysis`\n",
    "* The function should accept a `str` parameter specifying the data url or directory path and filename of our thrombolysis data set.\n",
    "* For now set the function up to read in the data (from exercise 1) and clean the feature headers (exercise 2). \n",
    "* The function should return a `pd.DataFrame` containing the stroke thrombolysis data.\n",
    "\n",
    "**Hints**:\n",
    "* Get into the habit of writing a simple docstring for your functions.\n",
    "\n",
    "> This function will come in handy for the later exercises where you may make mistakes and muck up your nicely cleaned datasets!  You can just reload and process them with one command after this exercise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "74f726fd-bef5-4f9c-a3b8-f48c4cd4222c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code here ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5666a510-d581-400d-969c-a8cac134dd84",
   "metadata": {},
   "outputs": [],
   "source": [
    "# example solution\n",
    "\n",
    "def wrangle_lysis(path):\n",
    "    '''\n",
    "    Preprocess and clean the stroke thrombolysis dataset.\n",
    "    \n",
    "    Params:\n",
    "    -------\n",
    "    path: str\n",
    "        URL or directory path and filename \n",
    "        \n",
    "    Returns:\n",
    "    --------\n",
    "    pd.DataFrame\n",
    "        Preprocessed stroke thrombolysis data\n",
    "    '''\n",
    "    lysis = pd.read_csv(path)\n",
    "    lysis.columns = clean_feature_names(list(lysis.columns))\n",
    "    return lysis\n",
    "\n",
    "def clean_feature_names(current_feature_names):\n",
    "    '''\n",
    "    Clean the stroke lysis feature names.\n",
    "    1. All lower case\n",
    "    2. Replace spaces with '_'\n",
    "    3. Rename 'label' column to 'treated'\n",
    "    \n",
    "    Params:\n",
    "    ------\n",
    "    current_feature_names: list\n",
    "        List of the feature names \n",
    "        \n",
    "    Returns:\n",
    "    -------\n",
    "    list\n",
    "        A modified list of feature names \n",
    "    '''\n",
    "    feature_names = [s.lower().replace(' ', '_') for s in current_feature_names]\n",
    "    feature_names[-1] = 'treated'\n",
    "    return feature_names\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b281e175-5448-497e-803c-c383e9fd3a35",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 2000 entries, 0 to 1999\n",
      "Data columns (total 8 columns):\n",
      " #   Column                Non-Null Count  Dtype \n",
      "---  ------                --------------  ----- \n",
      " 0   hospital              2000 non-null   object\n",
      " 1   male                  2000 non-null   object\n",
      " 2   age                   2000 non-null   int64 \n",
      " 3   severity              1753 non-null   object\n",
      " 4   stroke_type           1801 non-null   object\n",
      " 5   comorbidity           915 non-null    object\n",
      " 6   s2rankinbeforestroke  2000 non-null   int64 \n",
      " 7   treated               2000 non-null   object\n",
      "dtypes: int64(2), object(6)\n",
      "memory usage: 125.1+ KB\n"
     ]
    }
   ],
   "source": [
    "DATA_URL = 'data/synth_lysis.csv'\n",
    "lysis = wrangle_lysis(DATA_URL)\n",
    "lysis.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5022c1c-9900-4bfe-8341-62bfeb3dc501",
   "metadata": {},
   "source": [
    "## Exercise 4: Explore categorical features\n",
    "\n",
    "A number of features are categorical.  For example, `male` contains two values `Y` (the patient is male) and `N` (the patient is not male).\n",
    "\n",
    "**Task**:\n",
    "* List the categories contained in the following fields:\n",
    "\n",
    "```python\n",
    "fields = ['hospital', 'male', 'severity', 'stroke_type', 'treated']\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "a8b1e7e0-db8f-4939-b41f-326fcf83fc7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code here ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7fc8b863-6f00-45a1-8f44-d97220aed03a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Hosp_7', 'Hosp_8', 'Hosp_1', 'Hosp_6', 'Hosp_2', 'Hosp_4',\n",
       "       'Hosp_3', 'Hosp_5'], dtype=object)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# example solution\n",
    "# one option is to list the unique fields individually e.g.\n",
    "# note you could also sort the array - but watch out for NaN's\n",
    "import numpy as np\n",
    "\n",
    "lysis['hospital'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "8eb677b9-617c-463e-b201-54883cdae4b2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hospital: ['Hosp_7' 'Hosp_8' 'Hosp_1' 'Hosp_6' 'Hosp_2' 'Hosp_4' 'Hosp_3' 'Hosp_5']\n",
      "male: ['Y' 'N']\n",
      "severity: ['Minor' 'Moderate to severe' nan 'Moderate' 'Severe' 'No stroke symtpoms']\n",
      "stroke_type: ['clot' 'bleed' nan]\n",
      "treated: ['N' 'Y']\n"
     ]
    }
   ],
   "source": [
    "# option 2 - do this in a loop\n",
    "categorical_features = ['hospital', 'male', 'severity', 'stroke_type', 'treated']\n",
    "\n",
    "for feature in categorical_features:\n",
    "    print(f'{feature}: {lysis[feature].unique()}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eac4129c-0b46-45a8-818c-18d28c95ff64",
   "metadata": {},
   "source": [
    "## Exercise 5: Encode categorical fields with 2 levels \n",
    "\n",
    "In exercise 4, you should find that the `male` and `treated` columns have two levels (yes and no). If we take `male` as an example we can encode it as a single feature as follows:\n",
    "\n",
    "$$ x_i = \\Bigg\\{ \\begin{matrix} 1 & \\mbox{if }i \\mbox{th person is male}\n",
    "\\\\  0 & \\mbox{if }i \\mbox{th person is female}\\end{matrix} $$\n",
    "\n",
    "> Note: we will deal with `stroke_type` which has two levels and missing data in exercise 6.\n",
    "\n",
    "**Task**:\n",
    "* Encode `male` and `treated` to be binary 0/1 fields.\n",
    "* Update `wrangle_lysis` to include this code.\n",
    "\n",
    "**Hints**\n",
    "* [Try the `pd.get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function. \n",
    "* Remember that you only need one variable when a categorical field has two values. You can use the `drop_first=True` to keep only one variable (just make sure its the right one!).   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "261cd1ba-6518-4a35-ae7a-8e73a1f74a80",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code here ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "722ebcb9-8f2b-4867-9354-ad77be1747da",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hospital</th>\n",
       "      <th>_male</th>\n",
       "      <th>male</th>\n",
       "      <th>age</th>\n",
       "      <th>severity</th>\n",
       "      <th>stroke_type</th>\n",
       "      <th>comorbidity</th>\n",
       "      <th>s2rankinbeforestroke</th>\n",
       "      <th>treated</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Hosp_7</td>\n",
       "      <td>1</td>\n",
       "      <td>Y</td>\n",
       "      <td>65</td>\n",
       "      <td>Minor</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>99</td>\n",
       "      <td>Moderate to severe</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>49</td>\n",
       "      <td>NaN</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Hosp_1</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>77</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>clot</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>86</td>\n",
       "      <td>Minor</td>\n",
       "      <td>clot</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  hospital  _male male  age            severity stroke_type   comorbidity  \\\n",
       "0   Hosp_7      1    Y   65               Minor        clot           NaN   \n",
       "1   Hosp_8      0    N   99  Moderate to severe        clot           NaN   \n",
       "2   Hosp_8      0    N   49                 NaN        clot           NaN   \n",
       "3   Hosp_1      0    N   77            Moderate        clot  Hypertension   \n",
       "4   Hosp_8      0    N   86               Minor        clot  Hypertension   \n",
       "\n",
       "   s2rankinbeforestroke treated  \n",
       "0                     0       N  \n",
       "1                     3       N  \n",
       "2                     0       N  \n",
       "3                     0       Y  \n",
       "4                     0       N  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# example solution\n",
    "# drop_first=True keeps the Y column\n",
    "male = pd.get_dummies(lysis['male'], drop_first=True)\n",
    "male.columns = ['_male']\n",
    "\n",
    "# we will insert to just double check\n",
    "lysis.insert(1,'_male', male)\n",
    "\n",
    "# check - looks okay 1's and 0's match with Y and N.\n",
    "lysis.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "3989261e-1565-4642-ab33-e669085a7b4c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hospital</th>\n",
       "      <th>_male</th>\n",
       "      <th>male</th>\n",
       "      <th>age</th>\n",
       "      <th>severity</th>\n",
       "      <th>stroke_type</th>\n",
       "      <th>comorbidity</th>\n",
       "      <th>s2rankinbeforestroke</th>\n",
       "      <th>_treated</th>\n",
       "      <th>treated</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Hosp_7</td>\n",
       "      <td>1</td>\n",
       "      <td>Y</td>\n",
       "      <td>65</td>\n",
       "      <td>Minor</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>99</td>\n",
       "      <td>Moderate to severe</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>49</td>\n",
       "      <td>NaN</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Hosp_1</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>77</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>clot</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "      <td>86</td>\n",
       "      <td>Minor</td>\n",
       "      <td>clot</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  hospital  _male male  age            severity stroke_type   comorbidity  \\\n",
       "0   Hosp_7      1    Y   65               Minor        clot           NaN   \n",
       "1   Hosp_8      0    N   99  Moderate to severe        clot           NaN   \n",
       "2   Hosp_8      0    N   49                 NaN        clot           NaN   \n",
       "3   Hosp_1      0    N   77            Moderate        clot  Hypertension   \n",
       "4   Hosp_8      0    N   86               Minor        clot  Hypertension   \n",
       "\n",
       "   s2rankinbeforestroke  _treated treated  \n",
       "0                     0         0       N  \n",
       "1                     3         0       N  \n",
       "2                     0         0       N  \n",
       "3                     0         1       Y  \n",
       "4                     0         0       N  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "treated = pd.get_dummies(lysis['treated'], drop_first=True)\n",
    "lysis.insert(len(lysis.columns)-1,'_treated', treated)\n",
    "\n",
    "# check - looks okay 1's and 0's match with Y and N.\n",
    "lysis.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "0f0ea2e8-c4b6-4a88-a031-54a0fa29bbf0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# update preprocessing function\n",
    "\n",
    "# example solution\n",
    "\n",
    "def wrangle_lysis(path):\n",
    "    '''\n",
    "    Preprocess and clean the stroke thrombolysis dataset.\n",
    "    \n",
    "    Params:\n",
    "    -------\n",
    "    path: str\n",
    "        URL or directory path and filename \n",
    "        \n",
    "    Returns:\n",
    "    --------\n",
    "    pd.DataFrame\n",
    "        Preprocessed stroke thrombolysis data\n",
    "    '''\n",
    "    lysis = pd.read_csv(path)\n",
    "    lysis.columns = clean_feature_names(list(lysis.columns))\n",
    "    encode_features(lysis)\n",
    "    return lysis\n",
    "\n",
    "\n",
    "def encode_features(df):\n",
    "    '''\n",
    "    Encode the features in the dataset\n",
    "    \n",
    "    Params:\n",
    "    ------\n",
    "    df: pd.DataFrame\n",
    "        lysis dataframe\n",
    "    '''\n",
    "    df['male'] = pd.get_dummies(lysis['male'], drop_first=True)\n",
    "    df['treated'] = pd.get_dummies(lysis['treated'], drop_first=True)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "d09e3210-6586-452f-8aa6-9ad8d2e4f2e1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hospital</th>\n",
       "      <th>male</th>\n",
       "      <th>age</th>\n",
       "      <th>severity</th>\n",
       "      <th>stroke_type</th>\n",
       "      <th>comorbidity</th>\n",
       "      <th>s2rankinbeforestroke</th>\n",
       "      <th>treated</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Hosp_7</td>\n",
       "      <td>1</td>\n",
       "      <td>65</td>\n",
       "      <td>Minor</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>0</td>\n",
       "      <td>99</td>\n",
       "      <td>Moderate to severe</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>0</td>\n",
       "      <td>49</td>\n",
       "      <td>NaN</td>\n",
       "      <td>clot</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Hosp_1</td>\n",
       "      <td>0</td>\n",
       "      <td>77</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>clot</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Hosp_8</td>\n",
       "      <td>0</td>\n",
       "      <td>86</td>\n",
       "      <td>Minor</td>\n",
       "      <td>clot</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  hospital  male  age            severity stroke_type   comorbidity  \\\n",
       "0   Hosp_7     1   65               Minor        clot           NaN   \n",
       "1   Hosp_8     0   99  Moderate to severe        clot           NaN   \n",
       "2   Hosp_8     0   49                 NaN        clot           NaN   \n",
       "3   Hosp_1     0   77            Moderate        clot  Hypertension   \n",
       "4   Hosp_8     0   86               Minor        clot  Hypertension   \n",
       "\n",
       "   s2rankinbeforestroke  treated  \n",
       "0                     0        0  \n",
       "1                     3        0  \n",
       "2                     0        0  \n",
       "3                     0        1  \n",
       "4                     0        0  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "DATA_URL = 'data/synth_lysis.csv'\n",
    "lysis = wrangle_lysis(DATA_URL)\n",
    "lysis.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c65b93e7-3664-4466-8c2d-b18e4e95246c",
   "metadata": {},
   "source": [
    "## Exercise 6: Encoding fields with > 2 categories\n",
    "\n",
    "The process to encode features with more than category is almost identical to that used in exercise 6.  For example the `hospital` field contains 8 unique values.  There are now two options.  \n",
    "\n",
    "1. encode as 7 dummy variables where all 0's indicates hospital 1.\n",
    "2. use a one-hot encoding and include 8 variables.  The additional degree of freedom allows you to encode missing data as all zeros. \n",
    "\n",
    "Note that some methods such as ordinary least squares regression require you to take approach one.  More flexible machine learning approaches can handle approach 2.  Here we will make use of approach 2.\n",
    "\n",
    "**Task**:\n",
    "* Use a one-hot encoding on the `hospital` column.\n",
    "* Use a one-hot encoding on the `stroke_type` column. You should prefix the new encoded columns with `stroke_type_`. I.e. you will have two columns `stroke_type_clot` and `stroke_type_bleed`.\n",
    "\n",
    "**Hints**:\n",
    "* One hot encoding is just the same as calling `pd.get_dummies`, but we set `drop_first=False`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "2e27fc42-c09c-4353-9eaa-55e85919fca0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code here ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "b59645a3-2aaa-404d-938a-9540f2aa5b93",
   "metadata": {},
   "outputs": [],
   "source": [
    "# example solution\n",
    "\n",
    "def wrangle_lysis(path):\n",
    "    '''\n",
    "    Preprocess and clean the stroke thrombolysis dataset.\n",
    "    \n",
    "    Params:\n",
    "    -------\n",
    "    path: str\n",
    "        URL or directory path and filename \n",
    "        \n",
    "    Returns:\n",
    "    --------\n",
    "    pd.DataFrame\n",
    "        Preprocessed stroke thrombolysis data\n",
    "    '''\n",
    "    lysis = pd.read_csv(path)\n",
    "    lysis.columns = clean_feature_names(list(lysis.columns))\n",
    "    \n",
    "    ## MODIFICATION ###########################################\n",
    "    # Function uses p.d concat to create new dataframe that must be returned\n",
    "    lysis = encode_features(lysis)\n",
    "    ###########################################################\n",
    "    \n",
    "    return lysis\n",
    "\n",
    "\n",
    "def encode_features(df):\n",
    "    '''\n",
    "    Modified function to encode the features in the dataset\n",
    "    \n",
    "    Params:\n",
    "    ------\n",
    "    df: pd.DataFrame\n",
    "        lysis dataframe\n",
    "        \n",
    "    Returns:\n",
    "    -------\n",
    "    pd.DataFrame\n",
    "    '''\n",
    "    df['male'] = pd.get_dummies(df['male'], drop_first=True)\n",
    "    df['treated'] = pd.get_dummies(df['treated'], drop_first=True)\n",
    "    \n",
    "    ###### MODIFICATION ###############################################\n",
    "    # Hospital and stroke type encoding.\n",
    "    # Note that the function must now return a dataframe \n",
    "        \n",
    "    # encode hospitals\n",
    "    hospitals = pd.get_dummies(df['hospital'], drop_first=False)\n",
    "\n",
    "    # concat the DataFrame's\n",
    "    df_encoded = pd.concat([hospitals, df], axis=1)\n",
    "\n",
    "    # drop the old 'hospital' feature\n",
    "    df_encoded.drop(['hospital'], inplace=True, axis=1)\n",
    "    \n",
    "    # encode stroke type. add stroke_type_ prefix to each new feature\n",
    "    stroke_type = pd.get_dummies(df_encoded['stroke_type'], drop_first=False, \n",
    "                                 dummy_na=False, prefix=\"stroke_type_\")\n",
    "\n",
    "    # update data frame - dropping original stroke_type column via slicing\n",
    "    INSERT_INDEX = 11\n",
    "    return pd.concat([df_encoded[df_encoded.columns[:INSERT_INDEX]], \n",
    "                      stroke_type, \n",
    "                      df_encoded[df_encoded.columns[INSERT_INDEX+1:]]], \n",
    "                     axis=1)\n",
    "    \n",
    "    #######################################################################"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "50558a12-3d9c-45c5-a486-82ff5b523d80",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Hosp_1</th>\n",
       "      <th>Hosp_2</th>\n",
       "      <th>Hosp_3</th>\n",
       "      <th>Hosp_4</th>\n",
       "      <th>Hosp_5</th>\n",
       "      <th>Hosp_6</th>\n",
       "      <th>Hosp_7</th>\n",
       "      <th>Hosp_8</th>\n",
       "      <th>male</th>\n",
       "      <th>age</th>\n",
       "      <th>severity</th>\n",
       "      <th>stroke_type__bleed</th>\n",
       "      <th>stroke_type__clot</th>\n",
       "      <th>comorbidity</th>\n",
       "      <th>s2rankinbeforestroke</th>\n",
       "      <th>treated</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>65</td>\n",
       "      <td>Minor</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>99</td>\n",
       "      <td>Moderate to severe</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>49</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>77</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>86</td>\n",
       "      <td>Minor</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>79</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>47</td>\n",
       "      <td>Severe</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>65</td>\n",
       "      <td>Minor</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>72</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Hypertension;Atrial Fib</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>84</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Atrial Fib</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>81</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>72</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Diabetes;TIA</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>Minor</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>64</td>\n",
       "      <td>Minor</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>80</td>\n",
       "      <td>Severe</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>76</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>73</td>\n",
       "      <td>Minor</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>72</td>\n",
       "      <td>Minor</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Hypertension;Diabetes</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>94</td>\n",
       "      <td>Moderate</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>67</td>\n",
       "      <td>Moderate to severe</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Hypertension</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Hosp_1  Hosp_2  Hosp_3  Hosp_4  Hosp_5  Hosp_6  Hosp_7  Hosp_8  male  age  \\\n",
       "0        0       0       0       0       0       0       1       0     1   65   \n",
       "1        0       0       0       0       0       0       0       1     0   99   \n",
       "2        0       0       0       0       0       0       0       1     0   49   \n",
       "3        1       0       0       0       0       0       0       0     0   77   \n",
       "4        0       0       0       0       0       0       0       1     0   86   \n",
       "5        0       0       0       0       0       0       0       1     0   79   \n",
       "6        0       0       0       0       0       1       0       0     0   47   \n",
       "7        0       0       0       0       0       0       0       1     0   65   \n",
       "8        0       0       0       0       0       0       0       1     0   72   \n",
       "9        0       0       0       0       0       0       0       1     0   84   \n",
       "10       0       0       0       0       0       0       0       1     0   81   \n",
       "11       1       0       0       0       0       0       0       0     0   72   \n",
       "12       0       0       0       0       0       0       0       1     0   40   \n",
       "13       0       0       0       0       0       0       0       1     1   64   \n",
       "14       0       0       0       0       0       0       0       1     0   80   \n",
       "15       0       1       0       0       0       0       0       0     0   76   \n",
       "16       0       0       0       0       0       0       0       1     0   73   \n",
       "17       0       0       0       0       0       0       0       1     1   72   \n",
       "18       0       0       0       0       0       0       0       1     0   94   \n",
       "19       0       0       0       0       0       0       0       1     1   67   \n",
       "\n",
       "              severity  stroke_type__bleed  stroke_type__clot  \\\n",
       "0                Minor                   0                  1   \n",
       "1   Moderate to severe                   0                  1   \n",
       "2                  NaN                   0                  1   \n",
       "3             Moderate                   0                  1   \n",
       "4                Minor                   0                  1   \n",
       "5                  NaN                   0                  1   \n",
       "6               Severe                   1                  0   \n",
       "7                Minor                   0                  1   \n",
       "8             Moderate                   0                  0   \n",
       "9             Moderate                   0                  1   \n",
       "10            Moderate                   0                  1   \n",
       "11            Moderate                   0                  1   \n",
       "12               Minor                   0                  1   \n",
       "13               Minor                   0                  1   \n",
       "14              Severe                   0                  1   \n",
       "15                 NaN                   0                  1   \n",
       "16               Minor                   0                  1   \n",
       "17               Minor                   0                  1   \n",
       "18            Moderate                   0                  1   \n",
       "19  Moderate to severe                   0                  1   \n",
       "\n",
       "                comorbidity  s2rankinbeforestroke  treated  \n",
       "0                       NaN                     0        0  \n",
       "1                       NaN                     3        0  \n",
       "2                       NaN                     0        0  \n",
       "3              Hypertension                     0        1  \n",
       "4              Hypertension                     0        0  \n",
       "5              Hypertension                     0        1  \n",
       "6                       NaN                     2        0  \n",
       "7              Hypertension                     0        0  \n",
       "8   Hypertension;Atrial Fib                     0        0  \n",
       "9                Atrial Fib                     0        1  \n",
       "10                      NaN                     0        1  \n",
       "11             Diabetes;TIA                     2        1  \n",
       "12                      NaN                     0        1  \n",
       "13                      NaN                     0        0  \n",
       "14                      NaN                     0        1  \n",
       "15                      NaN                     0        0  \n",
       "16                      NaN                     0        0  \n",
       "17    Hypertension;Diabetes                     0        0  \n",
       "18                      NaN                     1        0  \n",
       "19             Hypertension                     0        1  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lysis = wrangle_lysis(DATA_URL)\n",
    "lysis.head(20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}