Statistical procedures in numpy#

The following exercises test some of your new numpy skills on some basic statistical problems. In some cases there is more than one solution. Feel free to experiment with different methods.

Imports#

import numpy as np

# you will need requests and io to download the data 
# and load into a numpy array
import requests
import io

Exercise 1#

In this exercise, you will use numpy to generate a random variable following a standard normal distribution. You will then use the statistical functions built into numpy to analyse the distribution of the variable.

Task:

  • Take 10,000 samples from the standard normal distribution

  • Create a function called basic_descriptives that returns the mean, stdev, and 1st/99th percentiles of a numpy array parameter.

  • Call your function and printout the results.

Hints:

  • You can assume the numpy array passed to the function contains float data or you could check and raise an exception.

# your code here...

Exercise 2#

Reuse the data generated in exercise 1. You are going to analyse the tails of the distribution you generated.

Task:

  • Select only the samples that have a value greater than or equal to +1.96 and less than or equal to -1.96

  • Determine the proportion of data that falls into these tails.

  • Are these proportions what you expect for the standard normal distribution?

Hints:

  • You may want to create one or two general functions so that you can use to vary the cut-offs.

# your code here ...

Exercise 3:#

Assume you have the data

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
  • The 20% Winsorized mean of data is calculated on a modified data set where the top and bottom 10% of values are replaced by the 10th and 90th percentiles.

  • In this case the 10th percentile = 2 and the 90th = 10. Therefore the winsorized dataset is:

win_data = [2, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10]

win_mean = win_data.mean()

Task:

  • Write a function winsorize(data, cut_off=0.1)

  • The function must modify data (type np.ndarray) so that is it is winsorized.

  • A cut_off = 0.1 specifies that the function uses the 10th and 90th percentiles as cut-offs

Hints:

  • There are multiple ways to solve this problem

    • You could use a for loop

    • You could create a function that is vectorized using np.vectorize()

    • You could use np.where() and fancy indexing

# your code here ...

Exercise 4:#

Simple linear regression using data in numpy arrays

In this example we will load two `numpy arrays from file and conduct a simple linear regression. The method of Ordinary Least Squares is used to fit a linear model (think \(y = \beta_1 x + \beta_0 + \epsilon \) ) to some data stored in numpy arrays.

We have two datasets:

  • breach.csv: monthly totals of people waiting FOUR or more hours in English emergency departments

  • dtocs.csv: monthly total of the number of people waiting to be discharged (leave) hospial and go home or transfer to another form of healthcare.

You are going to (VERY naively) assess the relationship between these two variables. For the purposes of this exercise we are going to ignore that these two datasets are time-series data.

The library you will use to conduct linear regression is statsmodels.api You will use the class OLS which accepts two keyword arguments: y and x

Once you have conducted the analysis you will print the results summary.

Task:

  • Import statsmodels.api

  • Load the dtoc and breaches datasets

  • Treating breaches as the dependent variable and dtocs as the independent variable perform the regression.

  • What is the \(R^2\) of your (naive) model?

Hints:

  • The columns of data in the two files have a header (the first line is a string). You will need to remove that before you can create a numpy array. Take a look at the options in np.genfromtxt()

  • Your model will need an intercept. This is not automatically added in statsmodels. To do this you need to call statsmodels.api.add_constant(X). There is a worked example in the statsmodels documentation.

# statsmodels contains the OLS class you will use
import statsmodels.api as sm

RESPONSE_SUCCESS = 200
DTOC_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
                + 'hpdm139-datasets/main/dtocs.csv'
BREACH_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
                + 'hpdm139-datasets/main/breach.csv'

def download_datasets(url):
    '''
    Downloads the dtoc and breaches datasets to your local machine
    
    '''
    response = requests.get(DTOC_URL)
    if response.status_code == RESPONSE_SUCCESS:
        # write the file
        with open("dtocs.csv", "wb") as f:
            f.write(response.content)
            print('success: dtocs.csv downloaded.')
    else:
        print('connection error for dtocs')
    
    response = requests.get(BREACH_URL)
    if response.status_code == RESPONSE_SUCCESS:
        # write the file
        with open("breach.csv", "wb") as f:
            f.write(response.content)
            print('success: breach.csv downloaded.')
    else:
        print('connection error for dtocs')
download_datasets(DTOC_URL)
success: dtocs.csv downloaded.
success: breach.csv downloaded.
# your code here ...

Exercise 5:#

Preprocessing data to detrend a timeseries

In exercise 5, we conducted a simple linear regression to assess the relationship between two variables. Our initial analysis is problematic because both variables are time series and contain autocorrelation and trend. This means that we violate some of the assumptions of ordinary least squares regression and our estimates of the relationship are likely to be incorrect.

In practice, we would pre-process the data in the numpy arrays before conducting the regression. A simple way to do this is to take the first difference of the time series. If \(Y_t\) represents the value of the time series at time \(t\) then the first difference is equal to:

\[Y_t = Y_{t+1} - Y_t\]

Tip: If an array \(a\) has length \(n\) then an array of the first differences of \(a\) is \(n-1\)

Task:

  • Copy your solution code for exercise 4

  • Modify the code to take the first difference of the breach and dtoc arrays

  • Conduct the regression analysis using the first differences (detrended data)

  • Look at the new \(R^2\). Is there still strong evidence of a relationship?

Hints:

  • If you have an array data then to take the first difference of the array using:

differenced = np.diff(data)
# your code here ...