Arrays versus lists

Arrays versus lists#

Before we get too far into how to use numpy for health data science I want to spend a bit of time illustrating the importance of the numpy.ndarray. So far we have used standard python data structures such as a List for holding ‘arrays’ of data. As a reminder lists are easy to use and very flexible.

my_list = ['spam', 3, 9.5, 'eggs', ['sub list', 3], int]
my_list.append('foo')
my_list[0] = 999
print(my_list)

[999, 3, 9.5, 'eggs', ['sub list', 3], <class 'int'>, 'foo']

The flexibility of a List means that they are not well suited to scientific computing. numpy provides optimised efficient code for managing data (typically quantitative data). A favourite phrase of mine that I heard used to describe numpy is that its closer to the metal.

For scientific computing you should use numpy instead of Python Lists

Importing#

It is fairly standard to import numpy and give it the alias np

import numpy as np

Performance differences#

The fundamental building block of numpy is the numpy.ndarray. This standard for n-dimensional array. Let’s create one manually:

my_arr = np.array([1, 2, 3, 4, 5, 6])
print(my_arr, end='; ')  # print full array
print(my_arr[3])  # access an zero indexed element

[1 2 3 4 5 6]; 4

I know what you are thinking again! That looks and behaves just like a python list! If you take nothing else away from this section remember that an array is NOTHING like a list!

Let’s compare the speed to lists and array by summing 1 million values. First let’s create the data structures and values in memory.

The np.arange function works in a similar way to range (caveat range is a generator). It is creating a sequence of numeric values of a given datatype. For example, np.arange(5) is equivalent to np.array([0, 1, 2, 3, 4]). We will look at ways to create arrays in the next section.

python_list = list(range(1_000_000))
numpy_array = np.arange(1_000_000)

Now let’s compare the average time of computation using the ipython magic %timeit

%timeit sum(python_list)

4.51 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.sum(numpy_array)

382 µs ± 2.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If you are struggling to understand the time difference reported by %timeit I can assure you that it is substantial - an order of magnitude in fact. The python list is taking milliseconds per loop while the array is taking micro seconds. 1 ms = 1000 µs. This improvement in performance has implications for your scientific coding.

Difference in usage and behaviour#

For someone new to numpy it is helpful to remember the following:

Array size and datatype are declared upfront and data are stored efficiently in memory.

Let’s test this statement and try few operations on a ndarray that you would routinely use with a list. First let’s see if we can dynamically add a new element to the array (change its size).

my_arr.append(7)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-d7ebd5f93b3a> in <module>
----> 1 my_arr.append(7)

AttributeError: 'numpy.ndarray' object has no attribute 'append'

Due to the way numpy works, the size of an array is fixed. There is no direct append method (although as we will see later there is a way with associated performance penalty).

Now let’s look at the datatype difference:

my_arr[0] = 'Zero'

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-e9bed3bd39fc> in <module>
----> 1 my_arr[0] = 'Zero'

ValueError: invalid literal for int() with base 10: 'Zero'

Here numpy raised a ValueError. This was because my_arr can only contain integer values. Let’s see what happens if we try to overwrite element zero with a float.

my_arr[0] = 99.999999
print(my_arr)

[99  2  3  4  5  6]

This time there was no ValueError, but a more subtle error was introduced. The data was truncated to an integer. This is because memory is carefully managed in numpy again for efficieny.

This efficiency isn’t magic its C and memory management.#

numpy is efficient because all of the computatations it performs are ‘closer to the metal’ i.e. implemented in highly optimised C where arrays have a known fixed size and use contiguous blocks of memory (this is why we specify type and size upfront).

This might be confusing at first because python itself is written in C. The difference is that python syntax has an overhead associated with it to make it very easy to use, forgiving of your mistakes at runtime and flexible (thats more magical than speed imo). We’ve also demonstrated that python lists are very different from arrays. The size of lists isn’t fixed (and expanding them isn’t particularly expensive) and the data that can be stored can be of any type. The trade-off is that python lists use “random” locations in memory and also has to perform lots of type checking behind the scenes.