Do micro-optimisations matter?

Do micro-optimisations matter?#

In short micro-optimisations of your python code matter less than an efficient design. I.e. ‘trial and error’ versus a ‘prime sieve’ or multiple versus single data structures (at least in this example) is more important than a for loop versus a list slice. The difference is that the former gave us an order of magnitude speed up while the latter helped, but to a much smaller extent.

Another example of a micro-optimisation (at least in standard python) relates to the type of data structure used. prime_sieve2 made use of a list of boolean values. Let’s have a look at its size in memory in bytes.

import sys
candidates_bool = [True] * (10)
sys.getsizeof(candidates_bool)

As we only effectively track 0’s (False) and 1’s (True) there is an option to shrink the memory requirement substantially by using a python bytearray.

candidates_bytes =  bytearray(b"\x01") * 10
print(candidates_bytes)
print(sys.getsizeof(candidates_bytes))

bytearray(b'\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01')
67

That approximately halves memory usage. That’s a nice feature. prime_sieve3 updates our sieve to use a byte_array. We will also compare its performance to prime_sieve2

def prime_sieve3(n):
    # bytearray here instead of bools to reduce memory overhead.
    candidates = bytearray(b"\x01") * (n + 1)
    # 0 and 1's instead of False and True
    candidates[0] = 0
    candidates[1] = 0
    limit = int(math.sqrt(n)) + 1    
    
    for i in range(2, limit): 
        if candidates[i]:
            candidates[i+i::i] = [0] * ((n - i) // i)
            
    return [i for i in range(n+1) if candidates[i]]      

TEN_THOUSAND = 10_000
%timeit prime_sieve3(TEN_THOUSAND)

731 µs ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Well that’s a bit of a surprise. Our memory efficient prime_sieve3 is slightly slower on average than prime_sieve2. But, and its a big BUT, that’s only for the first 10,000 natural numbers. When we compute larger primes the pattern reverses. For example, prime_sieve3 is faster to compute the primes up to 10 million. This performance gap widens as we compute larger and larger primes. So for very small primes this ‘optimisation’ doesn’t help, but it may help for larger primes (large is of course relative).

TEN_MILLION = 10_000_000
%timeit prime_sieve2(TEN_MILLION)
%timeit prime_sieve3(TEN_MILLION)

1.06 s ± 6.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
801 ms ± 5.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)