Intro to profiling

Python's dirty little secret is that it can be made to run pretty fast.

The bare-metal HPC people will be angrily tweeting at me now, or rather, they would be if they could get their wireless drivers working.

Still, there are some things you really don't want to do in Python. Nested loops are usually a bad idea. But often you won't know where your code is slowing down just by looking at it and trying to accelerate everything can be a waste of time. (Developer time, that is, both now and in the future: you incur technical debt if you unintentionally obfuscate code to make it faster when it doesn't need to be).

The first step is always to find the bottlenecks in your code, via profiling: analyzing your code by measuring the execution time of its parts.

Tools

cProfile
line_profiler
timeit

Note: If you haven't already installed it, you can do

conda install line_profiler

or

pip install line_profiler

Some bad code

Here's a bit of code guaranteed to perform poorly: it sleeps for 1.5 seconds after doing any work! We will profile it and see where we might be able to help.

import numpy
from time import sleep

def bad_call(dude):
    sleep(.5)

def worse_call(dude):
    sleep(1)

def sumulate(foo):
    if not isinstance(foo, int):
        return

    a = numpy.random.random((1000, 1000))
    a @ a

    ans = 0
    for i in range(foo):
        ans += i

    bad_call(ans)
    worse_call(ans)

    return ans

sumulate(150)

using `cProfile`

cProfile is the built-in profiler in Python (available since Python 2.5). It provides a function-by-function report of execution time. First import the module, then usage is simply a call to cProfile.run() with your code as argument. It will print out a list of all the functions that were called, with the number of calls and the time spent in each.

import cProfile

cProfile.run('sumulate(150)')

         10 function calls in 1.549 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.023    0.023    1.549    1.549 <ipython-input-1-ab37b7e7eed9>:10(sumulate)
        1    0.000    0.000    0.501    0.501 <ipython-input-1-ab37b7e7eed9>:4(bad_call)
        1    0.000    0.000    1.001    1.001 <ipython-input-1-ab37b7e7eed9>:7(worse_call)
        1    0.000    0.000    1.549    1.549 <string>:1(<module>)
        1    0.000    0.000    1.549    1.549 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        2    1.502    0.751    1.502    0.751 {built-in method time.sleep}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.024    0.024    0.024    0.024 {method 'random_sample' of 'mtrand.RandomState' objects}

You can see here that when our code sumulate() executes, it spends almost all its time in the method time.sleep (a bit over 1.5 seconds).

If your program is more complicated that this cute demo, you'll have a hard time parsing the long output of cProfile. In that case, you may want a profiling visualization tool, like SnakeViz. But that is outside the scope of this tutorial.

using `line_profiler`

line_profiler offers more granular information thatn cProfile: it will give timing information about each line of code in a profiled function.

Load the line_profiler extension

%load_ext line_profiler

For a pop-up window with results in notebook:

IPython has an %lprun magic to profile specific functions within an executed statement. Usage: %lprun -f func_to_profile <statement> (get more help by running %lprun? in IPython).

Profiling two functions

%lprun -f bad_call -f worse_call sumulate(13)

Write results to a text file

%lprun -T timings.txt -f sumulate sumulate(12)

*** Profile printout saved to text file 'timings.txt'.

%load timings.txt

Let's break down what these results are telling us.

Line # corresponds to the line in a given script or notebook cell. To toggle line numbers in a code cell, hit l in command mode.

Hits shows how many times a given line was encountered as a result of running the sumulate function. You can see that most lines in sumulate were hit only once, while the lines in the for loop were hit several times.

Time is the amount of time spent, in total, on a given line. Note at the top of the profiling report that the time unit here is microseconds.

Per hit is the amount of time, on average, each hit on a given line took. This is the same as the Time column for lines that were only hit once, but you can see the Per hit time has more meaning in the for loop lines.

% Time is what we are really interested in. It tells us what percentage of the total run time of sumulate was spent on a given line. There are lots of ways to optimize all sorts of operations, but you should focus your time and energy on optimizing the code that is costing you the most time. You could try to further optimize Python's builtin matrix multiplication to get the a @ a to operate a little bit faster. You might even shave off a microsecond (although probably not). But who cares? That matrix multiply takes up 2% of runtime. Focus your efforts on the expensive parts, which here are bad_call and worse_call.

Profiling on the command line

Open file, add @profile decorator to any function you want to profile, then run

kernprof -l script_to_profile.py

which will generate script_to_profile.py.lprof (pickled result). To view the results, run

python -m line_profiler script_to_profile.py.lprof

from IPython.display import IFrame

IFrame('http://localhost:8888/terminals/1', width=800, height=700)

    <iframe
        width="800"
        height="700"
        src="http://localhost:8888/terminals/1"
        frameborder="0"
        allowfullscreen
    ></iframe>

`timeit`

timeit is not perfect, but it is helpful.

Potential concerns re: timeit

Returns minimum time of run
Only runs benchmark 3 times
It disables garbage collection

python -m timeit -v "print(42)"

python -m timeit -r 25 "print(42)"

python -m timeit -s "gc.enable()" "print(42)"

Line magic

%timeit x = 5

100000000 loops, best of 3: 11.3 ns per loop

Cell magic

%%timeit
x = 5
y = 6
x + y

10000000 loops, best of 3: 32.9 ns per loop

The -q flag quiets output. The -o flag allows outputting results to a variable. The -q flag sometimes disagrees with OSX so please remove it if you're having issues.

a = %timeit -qo x = 5

print(a.all_runs)

[1.1260504741221666, 1.1415640730410814, 1.1300840568728745]

print(a.best)

1.1260504741221666e-08

print(a.worst)

4.5797787606716156e-07