Statistical procedures in numpy#

The following exercises test some of your new numpy skills on some basic statistical problems. In some cases there is more than one solution. Feel free to experiment with different methods.

Imports#

import numpy as np

# you will need requests and io to download the data and load into a numpy array
import requests
import io

Exercise 1#

In this exercise, you will use numpy to generate a random variable following a standard normal distribution. You will then use the statistical functions built into numpy to analyse the distribution of the variable.

Task:

  • Take 10,000 samples from the standard normal distribution

  • Create a function called basic_descriptives that returns the mean, stdev, and 1st/99th percentiles of a numpy array parameter.

  • Call your function and printout the results.

Hints:

  • You can assume the numpy array passed to the function contains float data or you could check and raise an exception.

# your code here...
# example solution

SAMPLE_SIZE = 10_000

# create a random number generator
rng = np.random.default_rng(42)

# generate numpy array of size SAMPLE_SIZE using standard normal
samples = rng.normal(size=SAMPLE_SIZE)
print(type(samples))
print(samples.shape)
<class 'numpy.ndarray'>
(10000,)
def basic_descriptives(data):
    """
    Returns mean, stdev, and 1st and 99th percentile of a 1D numpy.ndarray
    
    Assumes `data` is numpy array of floats.
    
    Parameters:
    ------------
    data: numpy.ndarray 
        numeric data to analyse
        
    Returns:
    --------
    (float, float, float, float)
    """
    mean = data.mean()
    std = data.std()
    per_1st = np.percentile(data, 1) 
    per_99th = np.percentile(data, 99)
    
    return mean, std, per_1st, per_99th


results = basic_descriptives(samples)
print(results)
(-0.01024987541401165, 1.006285768537041, -2.3738111979173713, 2.3558409670159173)

Exercise 2#

Reuse the data generated in exercise 1. You are going to analyse the tails of the distribution you generated.

Task:

  • Select only the samples that have a value greater than or equal to +1.96 and less than or equal to -1.96

  • Determine the proportion of data that falls into these tails.

  • Are these proportions what you expect for the standard normal distribution?

Hints:

  • You may want to create one or two general functions so that you can use to vary the cut-offs.

# your code here ...

Example solution:

It is very simple to work with numpy arrays containing numeric data. For example if we wanted to find all of our samples that are greater than or equal to +1.96 we use:

# result is a array of bools
result = samples >= 1.96

print(result.shape)
print(type(result))
print(result)
(10000,)
<class 'numpy.ndarray'>
[False False False ... False False False]

The code returns a new numpy.ndarray that contains boolean (True/False) values. The value at array index i is True if the corresponding value at index i in array data is >= 2.3 and False otherwise. If we had used a python List we would have needed to loop through all of list items and perform the check ourselves.

Let’s create some generalised functions to return the probabilities that a value is greater or less than a user specified value in our data set.

To do that we need to know that

False == 0
True == 1

Therefore we can take the sum of our boolean array to find out how many array elements are greater or less than a user specified values. i.e.

(samples >= 1.96).sum()
257
def prob_great_than_or_equal_to(data, x):
    '''
    Return the proportion of the dataset that is greater than or equal to x
    
    Parameters:
    -----------
    data: numpy.ndarray 
        numeric data to analyse
    x: float
        Lower cut-off. 
        
    Returns:
    --------
    float
    '''
    return (data >= x).sum()/data.shape[0]


def prob_less_than_or_equal_to(data, x):
    '''
    Return the proportion of the dataset that is less than or equal to x
    
    Parameters:
    -----------
    data: numpy.ndarray 
        numeric data to analyse
    x: float
        Upper cut-off. 
        
    Returns:
    --------
    float
    '''
    return (data <= x).sum()/data.shape[0]

p1 = prob_great_than_or_equal_to(samples, 1.96)
p2 = prob_less_than_or_equal_to(samples, -1.96)

print(p1, p2)
print(1 - (p1+p2))
0.0257 0.0257
0.9486

Our test of these functions shows use that around 95% of data lie between points -1.96 and +1.96 (which again we would expect with the standard normal).

Exercise 3:#

Assume you have the data

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
  • The 20% Winsorized mean of data is calculated on a modified data set where the top and bottom 10% of values are replaced by the 10th and 90th percentiles.

  • In this case the 10th percentile = 2 and the 90th = 10. Therefore the winsorized dataset is:

win_data = [2, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10]

win_mean = win_data.mean()

Task:

  • Write a function winsorize(data, cut_off=0.1)

  • The function must modify data (type np.ndarray) so that is it is winsorized.

  • A cut_off = 0.1 specifies that the function uses the 10th and 90th percentiles as cut-offs

Hints:

  • There are multiple ways to solve this problem

    • You could use a for loop

    • You could create a function that is vectorized using np.vectorize()

    • You could use np.where() and fancy indexing

# your code here ...
# =============================================================================
# Exercise 3
# Solution 1: Using a `for` loop
# =============================================================================


def winsorize_loop(data, cut_off = 0.1):
    low = np.percentile(data, cut_off * 100)
    high = np.percentile(data, (1-cut_off) * 100)
    
    for i in range(data.shape[0]):
        if data[i] > high:
            data[i] = high
        elif data[i] < low:
            data[i] = low

            
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

winsorize_loop(data)

print(data)
[ 2  2  3  4  5  6  7  8  9 10 10]
# =============================================================================
# Exercise 3
# Solution 2: Using a vectorized functioon
# =============================================================================

def winsorize_to_vectorize(to_test, low, high):

    if to_test > high:
        return high
    elif to_test < low:
        return low
    else:
        return to_test
    
def winsorize_limits(data, cut_off):
    low = np.percentile(data, cut_off * 100)
    high = np.percentile(data, (1-cut_off) * 100)
    
    return low, high
    
    
v_winsorise = np.vectorize(winsorize_to_vectorize)    

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
low, high = winsorize_limits(data, 0.1)
w_data = v_winsorise(data, low, high)

print(w_data)
[ 2.  2.  3.  4.  5.  6.  7.  8.  9. 10. 10.]
# =============================================================================
# Exercise 3
# Solution 3: Using np.where()
# =============================================================================

def winsorize(data, cut_off = 0.1):
    low = np.percentile(data, cut_off * 100)
    high = np.percentile(data, (1-cut_off) * 100)
       
    data[np.where(data < low)] = low
    data[np.where(data > high)] = high
    

#data = np.random.normal(10, 1, 100)
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])


winsorize(data)

print(data)
[ 2  2  3  4  5  6  7  8  9 10 10]
# performance testing of solutions
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
%timeit winsorize_loop(data)
147 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# reset data
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
%%timeit

low, high = winsorize_limits(data, 0.1)
w_data = v_winsorise(data, low, high)
151 µs ± 3.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
%timeit winsorize(data)
131 µs ± 2.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Exercise 4:#

Simple linear regression using data in numpy arrays

In this example we will load two `numpy arrays from file and conduct a simple linear regression. The method of Ordinary Least Squares is used to fit a linear model (think \(y = \beta_1 x + \beta_0 + \epsilon \) ) to some data stored in numpy arrays.

We have two datasets:

  • breach.csv: monthly totals of people waiting FOUR or more hours in English emergency departments

  • dtocs.csv: monthly total of the number of people waiting to be discharged (leave) hospial and go home or transfer to another form of healthcare.

You are going to (VERY naively) assess the relationship between these two variables. For the purposes of this exercise we are going to ignore that these two datasets are time-series data.

The library you will use to conduct linear regression is statsmodels.api You will use the class OLS which accepts two keyword arguments: y and x

Once you have conducted the analysis you will print the results summary.

Task:

  • Import statsmodels.api

  • Load the dtoc and breaches datasets

  • Treating breaches as the dependent variable and dtocs as the independent variable perform the regression.

  • What is the \(R^2\) of your (naive) model?

Hints:

  • The columns of data in the two files have a header (the first line is a string). You will need to remove that before you can create a numpy array. Take a look at the options in np.genfromtxt()

  • Your model will need an intercept. This is not automatically added in statsmodels. To do this you need to call statsmodels.api.add_constant(X). There is a worked example in the statsmodels documentation.

# statsmodels contains the OLS class you will use
import statsmodels.api as sm

RESPONSE_SUCCESS = 200
DTOC_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
                + 'hpdm139-datasets/main/dtocs.csv'
BREACH_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
                + 'hpdm139-datasets/main/breach.csv'

def download_datasets(url):
    '''
    Downloads the dtoc and breaches datasets to your local machine
    
    '''
    response = requests.get(DTOC_URL)
    if response.status_code == RESPONSE_SUCCESS:
        # write the file
        with open("dtocs.csv", "wb") as f:
            f.write(response.content)
            print('success: dtocs.csv downloaded.')
    else:
        print('connection error for dtocs')
    
    response = requests.get(BREACH_URL)
    if response.status_code == RESPONSE_SUCCESS:
        # write the file
        with open("breach.csv", "wb") as f:
            f.write(response.content)
            print('success: breach.csv downloaded.')
    else:
        print('connection error for dtocs')
download_datasets(DTOC_URL)
success: dtocs.csv downloaded.
success: breach.csv downloaded.
# your code here ...
#OLS functionality is in statsmodels
import statsmodels.api as sm  

def load_dtoc_dataset():
    '''
    Loads the breach and dtoc data sets into memory
    Returns a tuple of numpy.ndarrays representing
    breach and dtoc dataset respectively.
    '''    
    # note we use skip_header because the dataset has column descriptors
    dtoc = np.genfromtxt('dtocs.csv', skip_header=1)  
    breach = np.genfromtxt('breach.csv', skip_header=1)
    return breach, dtoc
    
breach, dtoc = load_dtoc_dataset()

###### regression code ########

# add an intercept term to the model
dtoc = sm.add_constant(dtoc) 
model = sm.OLS(breach, dtoc)
results = model.fit()

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.714
Model:                            OLS   Adj. R-squared:                  0.710
Method:                 Least Squares   F-statistic:                     194.6
Date:                Thu, 14 Oct 2021   Prob (F-statistic):           6.80e-23
Time:                        17:21:39   Log-Likelihood:                -945.02
No. Observations:                  80   AIC:                             1894.
Df Residuals:                      78   BIC:                             1899.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.633e+05      2e+04     -8.178      0.000   -2.03e+05   -1.24e+05
x1            57.7282      4.138     13.950      0.000      49.489      65.967
==============================================================================
Omnibus:                       11.472   Durbin-Watson:                   0.888
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               16.361
Skew:                           0.591   Prob(JB):                     0.000280
Kurtosis:                       4.874   Cond. No.                     2.61e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.61e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Exercise 5:#

Preprocessing data to detrend a timeseries

In exercise 5, we conducted a simple linear regression to assess the relationship between two variables. Our initial analysis is problematic because both variables are time series and contain autocorrelation and trend. This means that we violate some of the assumptions of ordinary least squares regression and our estimates of the relationship are likely to be incorrect.

In practice, we would pre-process the data in the numpy arrays before conducting the regression. A simple way to do this is to take the first difference of the time series. If \(Y_t\) represents the value of the time series at time \(t\) then the first difference is equal to:

\[Y_t = Y_{t+1} - Y_t\]

Tip: If an array \(a\) has length \(n\) then an array of the first differences of \(a\) is \(n-1\)

Task:

  • Copy your solution code for exercise 4

  • Modify the code to take the first difference of the breach and dtoc arrays

  • Conduct the regression analysis using the first differences (detrended data)

  • Look at the new \(R^2\). Is there still strong evidence of a relationship?

Hints:

  • If you have an array data then to take the first difference of the array using:

differenced = np.diff(data)
# Example solution: Use numpy built-in function np.diff()

def load_dtoc_dataset():
    '''
    Loads the breach and dtoc data sets into memory
    Returns a tuple of numpy.ndarrays representing
    breach and dtoc dataset respectively.
    '''    
    # note we use skip_header because the dataset has column descriptors
    dtoc = np.genfromtxt('dtocs.csv', skip_header=1)  
    breach = np.genfromtxt('breach.csv', skip_header=1)
    return breach, dtoc


def dtoc_regression(breach, dtoc):
    '''
    Regression analysis for dtocs
    and breaches
    
    Keyword arguments:
    breach - dependent variable
    dtoc - independent variable
    '''
    dtoc = sm.add_constant(dtoc) # an intercept term to the model
    model = sm.OLS(breach, dtoc)
    results = model.fit()
    print(results.summary())
    
breach, dtoc = load_dtoc_dataset()

# difference the code using np.diff 
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.diff.html
d_breach = np.diff(breach)
d_dtoc = np.diff(dtoc)

dtoc_regression(d_breach, d_dtoc)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.035
Model:                            OLS   Adj. R-squared:                  0.022
Method:                 Least Squares   F-statistic:                     2.775
Date:                Thu, 14 Oct 2021   Prob (F-statistic):             0.0998
Time:                        17:21:39   Log-Likelihood:                -901.09
No. Observations:                  79   AIC:                             1806.
Df Residuals:                      77   BIC:                             1811.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2417.0761   2484.505      0.973      0.334   -2530.206    7364.358
x1           -13.3144      7.992     -1.666      0.100     -29.229       2.600
==============================================================================
Omnibus:                       16.746   Durbin-Watson:                   2.174
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               39.038
Skew:                          -0.640   Prob(JB):                     3.33e-09
Kurtosis:                       6.197   Cond. No.                         312.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.