Statistical procedures

Statistical procedures#

A substantial proportion of real world applications in computational modelling require statistical procedures. numpy provides a wide variety of efficient statistical functions for you to employ on an array. This section will explore the (simple and) commonly used functions that you can embed into your models and use for practical statistical analysis.

Tip: We will explore statistical programming for health data science in a lot more detail in Part 2 using `pandas` and other important libraries. It is well worth learning `numpy` capabilities, however, as converting from a `np.ndarray` to a `pandas.DataFrame` during a computational procedure can be expensive.

Simple data analysis example.#

ED attendance data#

We will first use data held in the minor_illness_ed_attends.csv. This is a synthetic time series dataset reporting the number of patients registered at GP surgery who attend ED each week. The data are standardised to 10k of registered patients.

Loading the dataset#

Let’s first open the data and then construct some summary statistics

file_name = 'data/minor_illness_ed_attends.csv'
ed_data = np.loadtxt(file_name, skiprows=1, delimiter=',')
print(ed_data.shape)

(74,)

Here’s a peak the first 5 elements in ed_data.

ed_data[:5]

array([2.11927795, 3.49057545, 3.98922908, 2.36860477, 3.24124863])

Calculate summary statistics#

numpy makes it easy to calculate means, stdev and other summary statistics of an ndarray. This typically have intuitive names which you can often guess. For example:

print(ed_data.mean())
print(ed_data.std())

2.919482262743243
0.7060975938853894

But it is always worth reading the docs. For example to calculate an unbiased estimate of the sample standard deviation of a data set of length \(n\) we should divide by \(n - 1\). In numpy we do this using the ddof (degree’s of freedom) parameter of std.

Here we will create a class to act as a convienient container for a dataset. For convenience, we will override the __str__ method so that we can easily print a summary of the dataset to the screen when calling print.

The class will make use of the following numpy statistical procedures:

min() and max() to return the minimum and maximum value in array respectively
np.percentile() to calculate a percentile - here the median and upper / lower quartiles.
np.histogram() that bins data points and reports the frequencies.

class AttendanceSummary:
    '''
    Simple container class Hold mean, stdev and 5/95 percentiles of ed data
    '''
    def __init__(self, data, name=None, decimal_places=2):
        """
        Contructor method.
        
        Parameters:
        ----------
        data: numpy.ndarray 
            Vector containing data to analyse.
        """
        self.name = name
        self.dp = decimal_places
        self.n = len(data)
        self.mean = data.mean()
        self.std = data.std(ddof=1)
        self.min_attends = data.min()
        self.max_attends = data.max() 
        
        # percentiles: note that this is a np. call
        self.lower_quartile = np.percentile(data, 25)
        self.median = np.percentile(data, 50)
        self.upper_quartile = np.percentile(data, 75)
        self.iqr = self.upper_quartile - self.lower_quartile
        
        # frequency histogram (automatically binned)
        self.freq, self.bins = np.histogram(data, density=False)
        
    def __repr__(self):
        to_print = f'\nDataset: {self.name}' \
             + f'\nMean:\t{self.mean:.{self.dp}f}' \
             + f'\nStdev:\t{self.std:.{self.dp}f}' \
             + f'\nMin:\t{self.min_attends:.{self.dp}f}' \
             + f'\nMax:\t{self.max_attends:.{self.dp}f}' \
             + f'\nMedian:\t{self.median:.{self.dp}f}' \
             + f'\nIQR:\t{self.iqr:.{self.dp}f}' \
             + f'\nn:\t{self.n}'
                    
        return to_print
    
    def frequency_histogram(self):
        print('x\tf(x)')
        for f, b in zip(self.freq, self.bins):
            print(f'{b:.{self.dp}f}\t{f}')

x = AttendanceSummary(ed_data, name="ED")
x

Dataset: ED
Mean:	2.92
Stdev:	0.71
Min:	1.62
Max:	5.11
Median:	2.87
IQR:	1.09
n:	74

x.frequency_histogram()

Its not necessary to always use a class as I here, but it is useful if for example you want to compare samples or in the case of a time series period.

In our ED data let’s assume an intervention was put in place in week 10: GP surgeries were opened to patients over weekends. Let’s summarise the results using descriptive statistics.

week = 10  # the week the intervention begins
before = AttendanceSummary(ed_data[:week], 'before')
after = AttendanceSummary(ed_data[week:], 'after')
print(before)
print(after)

Dataset: before
Mean:	3.12
Stdev:	0.59
Min:	2.12
Max:	3.99
Median:	3.18
IQR:	0.81
n:	10

Dataset: after
Mean:	2.89
Stdev:	0.73
Min:	1.62
Max:	5.11
Median:	2.87
IQR:	0.90
n:	64

Statistical procedures and n-dimensional arrays#

Generating statistics for matricies and n-dimensional arrays works in a similar manner to vectors. We can also go a bit further and calculate statistics across axes in the array; for example, rows and columns in a matrix. To do this you need to specify an integer axis in the call to the statistical function e.g. mean(axis=0)

Example 1: mean of matrix columns and rows#

Given matrix calculate the mean of the columns and rows. We will first generate these manually using slices so that you can see it is working correctly.

matrix = np.array([[7.7, 4.3, 8.5],
                   [6.9, 0.9, 9.7],
                   [7.6, 7.8, 1.2]])

To calculate the mean of the first column we need to slice all rows and the column at index 0. As a reminder we would do the following:

Try changing the value of 0 to 1 or 2 to get the other columns.

matrix[:,0]

array([7.7, 6.9, 7.6])

This is a vector and hence calculating the mean is as we have already seen

here I also call .round(1) to also limit displayed the result to 1 decimal place. You could also use a format str.

matrix[:,0].mean().round(1)

7.4

print(matrix[:,1].mean().round(1))
print(matrix[:,2].mean().round(1))

4.3
6.5

This is a common operation so we can instead specify an axis. In a matrix we use axis 0 for columns and 1 for rows.

matrix.mean(axis=0).round(1)

array([7.4, 4.3, 6.5])

matrix.mean(axis=1).round(1)

array([6.8, 5.8, 5.5])

Example 2: Statistical operations on a larger dataset#

Let’s open a more realistic tabular dataset that you would use in a real health data science study. For this well download the Wisconsin breast cancer dataset from Github, create a np.ndarray and summarise the features.

To download the file we will use the requests library:

import requests

#download successful
RESPONSE_SUCCESS = 200
FILE_NAME = 'data/wisconsin.csv'

# note: github raw url needed for download
DATA_URL = "https://raw.githubusercontent.com/health-data-science-OR/" \
            + "hpdm139-datasets/main/wisconsin.csv"

# download the file...(only needs to be done once!).
print('downloading =>', end=' ')
response = requests.get(DATA_URL)

if response.status_code == RESPONSE_SUCCESS:
    print('successful')
    
    # response.content is a in bytes
    # write to file to read into numpy.
    with open(FILE_NAME, 'wb') as f:
        f.write(response.content)
else:
    print('file not found.')

downloading => successful

Before we attempt to load the numpy array from let’s do some due diligence and check for headers and any non-numeric column data types. We need to do this to avoid any issues when loading the data. We will go a bit old school here and use the csv library to help us with the inspection.

This isn’t a big dataset so we could just inspect it in a CSV viewer in JupyterLab (or in a free and open spreadsheet package like LibreOffice Calc). But I’d encourage you to keep your complete workflow documented in health data science. In some cases the dataset will also be far to big for memory and you’ll standard viewers won’t be able to cope.

import csv

with open(FILE_NAME, 'r') as f:
    reader = csv.reader(f)
    header_row = next(reader)
    field_count = len(header_row)
    # peak at the header_row
    print(header_row[:5])
    # record names
    header_names = [header for header in header_row]
    print(f'No. fields: {field_count}')
    
    # peak at first data row
    first_data_row = next(reader)
    
    non_numeric_fields = []
    # catch any fields that contain non numeric data
    for field_index in range(len(first_data_row)):
        try:
            field = float(first_data_row[field_index])
        except:
            non_numeric_fields.append(field_index)
    
    # print out non-numeric fields
    print(f'Fields with non-numeric data: {non_numeric_fields}')

['', 'id', 'diagnosis', 'radius_mean', 'texture_mean']
No. fields: 33
Fields with non-numeric data: [2]

We can conclude that the first row in the file is indeed a header row - we’ll skip that on the array import. We will also drop the fields denoted ‘ ‘ and ‘id’. The field at index 2: diagnosis contains non-numeric data. This is actually the target field in the dataset (‘M’ for malignant and ‘B’ for benign). We will deal with the target seperately here.

The first thing we will do is create a list of the field indexes containing the numeric data to read in, using boolean indexing.

field_indexes = np.arange(field_count)
fields_included = field_indexes[~np.isin(field_indexes, non_numeric_fields)]
fields_included = fields_included[2:]
header_names = np.array(header_names)[fields_included]
print(fields_included)

[ 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
 27 28 29 30 31 32]

Then its just a case of loading the correct rows and columns into an array. Here we will use np.genfromtxt

wisconsin = np.genfromtxt(FILE_NAME, skip_header=1, usecols=fields_included, delimiter=',')
print(wisconsin.shape)

(569, 30)

We now have a clean dataset that we can summarise. All fields are quantatitive so this is straightforward. The class FeatureSummary below helps us visualise the results of each feature (column). Call the .show() method to see a summary.

class FeatureSummary:
    '''
    Simple container class to hold and display a descriptive summary
    of features in a dataset
    '''
    def __init__(self, data, headers, name=None, decimal_places=2):
        """
        Contructor method.
        
        Parameters:
        ----------
        data: numpy.ndarray 
            Vector containing data to analyse.
        """
        self.name = name
        self.headers = headers
        # these help with the summary layout.
        self.max_header_padding_len = max(len(x) for x in headers)
        self.header_padding = ''.rjust(self.max_header_padding_len)
        print(self.header_padding)
        self.dp = decimal_places
        self.n = len(data)
        self.mean = data.mean(axis=0)
        self.std = data.std(ddof=1, axis=0)
        self.min = data.min(axis=0)
        self.max = data.max(axis=0) 
        self.range = self.max - self.min
        
        # percentiles: note that this is a np. call
        self.lower_quartile = np.percentile(data, 25, axis=0)
        self.median = np.percentile(data, 50, axis=0)
        self.upper_quartile = np.percentile(data, 75, axis=0)
        self.iqr = self.upper_quartile - self.lower_quartile
    
    def show(self):
        print(self.name)
        header_row = f"{self.header_padding}\tMean\tStd\tMedian\tIQR\tRange"
        print(header_row)
        line = ['-'] * int(len(header_row) * 1.3)
        print(''.join(line))
        for i in range(len(self.headers)):
            row_padding_len = self.max_header_padding_len - len(self.headers[i])
            row_padding = ''.rjust(row_padding_len)
            field = f'{self.headers[i]}{row_padding}\t{self.mean[i]:.{self.dp}f}' \
                    + f'\t{self.std[i]:.{self.dp}f}' \
                    + f'\t{self.median[i]:.{self.dp}f}' \
                    + f'\t{self.iqr[i]:.{self.dp}f}' \
                    + f'\t{self.range[i]:.{self.dp}f}'
            print(field) 

x = FeatureSummary(wisconsin, header_names, name='Wisconson Breast Cancer Dataset.')
x.show()

                       
Wisconson Breast Cancer Dataset.
                       	Mean	Std	Median	IQR	Range
---------------------------------------------------------------
radius_mean            	14.13	3.52	13.37	4.08	21.13
texture_mean           	19.29	4.30	18.84	5.63	29.57
perimeter_mean         	91.97	24.30	86.24	28.93	144.71
area_mean              	654.89	351.91	551.10	362.40	2357.50
smoothness_mean        	0.10	0.01	0.10	0.02	0.11
compactness_mean       	0.10	0.05	0.09	0.07	0.33
concavity_mean         	0.09	0.08	0.06	0.10	0.43
concave points_mean    	0.05	0.04	0.03	0.05	0.20
symmetry_mean          	0.18	0.03	0.18	0.03	0.20
fractal_dimension_mean 	0.06	0.01	0.06	0.01	0.05
radius_se              	0.41	0.28	0.32	0.25	2.76
texture_se             	1.22	0.55	1.11	0.64	4.52
perimeter_se           	2.87	2.02	2.29	1.75	21.22
area_se                	40.34	45.49	24.53	27.34	535.40
smoothness_se          	0.01	0.00	0.01	0.00	0.03
compactness_se         	0.03	0.02	0.02	0.02	0.13
concavity_se           	0.03	0.03	0.03	0.03	0.40
concave points_se      	0.01	0.01	0.01	0.01	0.05
symmetry_se            	0.02	0.01	0.02	0.01	0.07
fractal_dimension_se   	0.00	0.00	0.00	0.00	0.03
radius_worst           	16.27	4.83	14.97	5.78	28.11
texture_worst          	25.68	6.15	25.41	8.64	37.52
perimeter_worst        	107.26	33.60	97.66	41.29	200.79
area_worst             	880.58	569.36	686.50	568.70	4068.80
smoothness_worst       	0.13	0.02	0.13	0.03	0.15
compactness_worst      	0.25	0.16	0.21	0.19	1.03
concavity_worst        	0.27	0.21	0.23	0.27	1.25
concave points_worst   	0.11	0.07	0.10	0.10	0.29
symmetry_worst         	0.29	0.06	0.28	0.07	0.51
fractal_dimension_worst	0.08	0.02	0.08	0.02	0.15