# Statistical procedures in `numpy`

#

The following exercises test some of your new `numpy`

skills on some basic statistical problems. In some cases there is more than one solution. Feel free to experiment with different methods.

## Imports#

```
import numpy as np
# you will need requests and io to download the data
# and load into a numpy array
import requests
import io
```

## Exercise 1#

In this exercise, you will use `numpy`

to generate a random variable following a standard normal distribution. You will then use the statistical functions built into numpy to analyse the distribution of the variable.

**Task**:

Take 10,000 samples from the standard normal distribution

Create a function called

`basic_descriptives`

that returns the mean, stdev, and 1st/99th percentiles of a numpy array parameter.Call your function and printout the results.

**Hints**:

You can assume the numpy array passed to the function contains float data or you could check and raise an exception.

```
# your code here...
```

## Exercise 2#

Reuse the data generated in exercise 1. You are going to analyse the tails of the distribution you generated.

**Task:**

Select only the samples that have a value greater than or equal to +1.96 and less than or equal to -1.96

Determine the proportion of data that falls into these tails.

Are these proportions what you expect for the standard normal distribution?

**Hints**:

You may want to create one or two general functions so that you can use to vary the cut-offs.

```
# your code here ...
```

## Exercise 3:#

Assume you have the data

```
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
```

The 20% Winsorized mean of

`data`

is calculated on a modified data set where the top and bottom 10% of values are replaced by the 10th and 90th percentiles.In this case the 10th percentile = 2 and the 90th = 10. Therefore the winsorized dataset is:

```
win_data = [2, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10]
win_mean = win_data.mean()
```

**Task:**

Write a function

`winsorize(data, cut_off=0.1)`

The function must modify

`data`

(type`np.ndarray`

) so that is it is winsorized.A cut_off = 0.1 specifies that the function uses the 10th and 90th percentiles as cut-offs

**Hints:**

There are multiple ways to solve this problem

You could use a

`for`

loopYou could create a function that is vectorized using

`np.vectorize()`

You could use

`np.where()`

and fancy indexing

```
# your code here ...
```

### Exercise 4:#

**Simple linear regression using data in numpy arrays**

In this example we will load two `numpy arrays from file and conduct a simple linear regression. The method of Ordinary Least Squares is used to fit a linear model (think \(y = \beta_1 x + \beta_0 + \epsilon \) ) to some data stored in numpy arrays.

We have two datasets:

`breach.csv`

: monthly totals of people waiting FOUR or more hours in English emergency departments`dtocs.csv`

: monthly total of the number of people waiting to be discharged (leave) hospial and go home or transfer to another form of healthcare.

You are going to (VERY naively) assess the relationship between these two variables. For the purposes of this exercise we are going to ignore that these two datasets are time-series data.

The library you will use to conduct linear regression is `statsmodels.api`

You will use the class `OLS`

which accepts two keyword arguments: y and x

Once you have conducted the analysis you will print the results summary.

**Task**:

Import

`statsmodels.api`

Load the dtoc and breaches datasets

Treating breaches as the dependent variable and dtocs as the independent variable perform the regression.

What is the \(R^2\) of your (naive) model?

**Hints:**

The columns of data in the two files have a header (the first line is a string). You will need to remove that before you can create a numpy array. Take a look at the options in

`np.genfromtxt()`

Your model will need an

**intercept**. This is not automatically added in`statsmodels`

. To do this you need to call`statsmodels.api.add_constant(X)`

. There is a worked example in the statsmodels documentation.

```
# statsmodels contains the OLS class you will use
import statsmodels.api as sm
RESPONSE_SUCCESS = 200
DTOC_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/dtocs.csv'
BREACH_URL = 'https://raw.githubusercontent.com/health-data-science-OR/' \
+ 'hpdm139-datasets/main/breach.csv'
def download_datasets(url):
'''
Downloads the dtoc and breaches datasets to your local machine
'''
response = requests.get(DTOC_URL)
if response.status_code == RESPONSE_SUCCESS:
# write the file
with open("dtocs.csv", "wb") as f:
f.write(response.content)
print('success: dtocs.csv downloaded.')
else:
print('connection error for dtocs')
response = requests.get(BREACH_URL)
if response.status_code == RESPONSE_SUCCESS:
# write the file
with open("breach.csv", "wb") as f:
f.write(response.content)
print('success: breach.csv downloaded.')
else:
print('connection error for dtocs')
```

```
download_datasets(DTOC_URL)
```

```
success: dtocs.csv downloaded.
success: breach.csv downloaded.
```

```
# your code here ...
```

## Exercise 5:#

**Preprocessing data to detrend a timeseries**

In exercise 5, we conducted a simple linear regression to assess the relationship between two variables. Our initial analysis is problematic because both variables are time series and contain autocorrelation and trend. This means that we violate some of the assumptions of ordinary least squares regression and our estimates of the relationship are likely to be incorrect.

In practice, we would pre-process the data in the `numpy`

arrays before conducting the regression. A simple way to do this is to take the first difference of the time series. If \(Y_t\) represents the value of the time series at time \(t\) then the first difference is equal to:

**Tip**: If an array \(a\) has length \(n\) then an array of the first differences of \(a\) is \(n-1\)

**Task**:

Copy your solution code for exercise 4

Modify the code to take the first difference of the

`breach`

and`dtoc`

arraysConduct the regression analysis using the first differences (detrended data)

Look at the new \(R^2\). Is there still strong evidence of a relationship?

**Hints**:

If you have an array

`data`

then to take the first difference of the array using:

```
differenced = np.diff(data)
```

```
# your code here ...
```