# Arrays versus lists#

Before we get too far into how to use `numpy`

for health data science I want to spend a bit of time illustrating the importance of the `numpy.ndarray`

. So far we have used standard python data structures such as a `List`

for holding ‘arrays’ of data. As a reminder lists are easy to use and very flexible.

```
my_list = ['spam', 3, 9.5, 'eggs', ['sub list', 3], int]
my_list.append('foo')
my_list[0] = 999
print(my_list)
```

```
[999, 3, 9.5, 'eggs', ['sub list', 3], <class 'int'>, 'foo']
```

The flexibility of a `List`

means that they are not well suited to scientific computing. `numpy`

provides optimised efficient code for managing data (typically quantitative data). A favourite phrase of mine that I heard used to describe `numpy`

is that its **closer to the metal**.

For scientific computing you

shoulduse numpy instead of Python Lists

## Importing#

It is fairly standard to import `numpy`

and give it the alias `np`

```
import numpy as np
```

## Performance differences#

The fundamental building block of numpy is the `numpy.ndarray`

. This standard for **n-dimensional** array. Let’s create one manually:

```
my_arr = np.array([1, 2, 3, 4, 5, 6])
print(my_arr, end='; ') # print full array
print(my_arr[3]) # access an zero indexed element
```

```
[1 2 3 4 5 6]; 4
```

I know what you are thinking again! That looks and behaves just like a python `list`

! If you take nothing else away from this section remember that **an array is NOTHING like a list!**

Let’s compare the speed to lists and array by summing 1 million values. First let’s create the data structures and values in memory.

The

`np.arange`

function works in a similar way to`range`

(caveat range is a generator). It is creating a sequence of numeric values of a given datatype. For example,`np.arange(5)`

is equivalent to`np.array([0, 1, 2, 3, 4])`

. We will look at ways to create arrays in the next section.

```
python_list = list(range(1_000_000))
numpy_array = np.arange(1_000_000)
```

Now let’s compare the average time of computation using the ipython magic `%timeit`

```
%timeit sum(python_list)
```

```
4.47 ms ± 102 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

```
%timeit np.sum(numpy_array)
```

```
260 μs ± 18 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

If you are struggling to understand the time difference reported by

`%timeit`

I can assure you that it is substantial - an order of magnitude in fact. The python list is taking milliseconds per loop while the array is taking micro seconds. 1 ms = 1000 µs. This improvement in performance has implications for your scientific coding.

## Difference in usage and behaviour#

For someone new to `numpy`

it is helpful to remember the following:

Array size and datatype are declared

**upfront**and data are stored efficiently in memory.

Let’s test this statement and try few operations on a `ndarray`

that you would routinely use with a `list`

. First let’s see if we can dynamically add a new element to the array (change its size).

```
my_arr.append(7)
```

```
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 my_arr.append(7)
AttributeError: 'numpy.ndarray' object has no attribute 'append'
```

Due to the way `numpy`

works, the size of an array is **fixed**. There is no direct append method (although as we will see later there is a way with associated performance penalty).

Now let’s look at the datatype difference:

```
my_arr[0] = 'Zero'
```

```
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[8], line 1
----> 1 my_arr[0] = 'Zero'
ValueError: invalid literal for int() with base 10: 'Zero'
```

Here `numpy`

raised a `ValueError`

. This was because `my_arr`

can only contain integer values. Let’s see what happens if we try to overwrite element zero with a float.

```
my_arr[0] = 99.999999
print(my_arr)
```

```
[99 2 3 4 5 6]
```

This time there was no `ValueError`

, but a more subtle error was introduced. The data was truncated to an integer. This is because memory is carefully managed in `numpy`

again for efficieny.

## This efficiency isn’t magic its C and memory management.#

`numpy`

is efficient because all of the computatations it performs are ‘closer to the metal’ i.e. implemented in highly optimised C where arrays have a known fixed size and use contiguous blocks of memory (this is why we specify type and size upfront).

This might be confusing at first because python itself is written in C. The difference is that python syntax has an overhead associated with it to make it very easy to use, forgiving of your mistakes at runtime and flexible (thats more magical than speed imo). We’ve also demonstrated that python lists are very different from arrays. The size of lists isn’t fixed (and expanding them isn’t particularly expensive) and the data that can be stored can be of any type. The trade-off is that python lists use “random” locations in memory and also has to perform lots of type checking behind the scenes.