How to benchmark Python code with pytest-benchmark
Everett Pompeii
What is Benchmarking?
Benchmarking is the practice of testing the performance of your code to see how fast (latency) or how much work (throughput) it can do. This often overlooked step in software development is crucial for creating and maintaining fast and performant code. Benchmarking provides the necessary metrics for developers to understand how well their code performs under various workloads and conditions. For the same reasons that you write unit and integration tests to prevent feature regressions, you should write benchmarks to prevent performance regressions. Performance bugs are bugs!
Write FizzBuzz in Python
In order to write benchmarks, we need some source code to benchmark. To start off we are going to write a very simple program, FizzBuzz.
The rules for FizzBuzz are as follows:
Write a program that prints the integers from
1to100(inclusive):
- For multiples of three, print
Fizz- For multiples of five, print
Buzz- For multiples of both three and five, print
FizzBuzz- For all others, print the number
There are many ways to write FizzBuzz. So we’ll go with the my favorite:
for i in range(1, 101): if n % 15 == 0: print("FizzBuzz") elif n % 3 == 0: print("Fizz") elif n % 5 == 0: print("Buzz") else: print(i)- Iterate from
1to100, using a range of101. - For each number, calculate the modulus (remainder after division).
- If the remainder is
0, then the number is a multiple of the given factor:- If the remainder is
0for15, then printFizzBuzz. - If the remainder is
0for3, then printFizz. - If the remainder is
0for5,then printBuzz.
- If the remainder is
- Otherwise, just print the number.
Follow Step-by-Step
In order to follow along with this set-by-step tutorial, you will need to install Python and install pipenv.
🐰 The source code for this post is available on GitHub.
Create a Python file named game.py,
and set its contents to the above FizzBuzz implementation.
Then run python game.py.
The output should look like:
$ python game.py12Fizz4BuzzFizz78FizzBuzz11Fizz1314FizzBuzz...9798FizzBuzz🐰 Boom! You’re cracking the coding interview!
Before going any further, it is important to discuss the differences between micro-benchmarking and macro-benchmarking.
Micro-Benchmarking vs Macro-Benchmarking
There are two major categories of software benchmarks: micro-benchmarks and macro-benchmarks.
Micro-benchmarks operate at a level similar to unit tests.
For example, a benchmark for a function that determines Fizz, Buzz, or FizzBuzz for a single number would be a micro-benchmark.
Macro-benchmarks operate at a level similar to integration tests.
For example, a benchmark for a function that plays the entire game of FizzBuzz, from 1 to 100, would be a macro-benchmark.
Generally, it is best to test at the lowest level of abstraction possible. In the case benchmarks, this makes them both easier to maintain, and it helps to reduce the amount of noise in the measurements. However, just as having some end-to-end tests can be very useful for sanity checking the entire system comes together as expected, having macro-benchmarks can be very useful for making sure that the critical paths through your software remain performant.
Benchmarking in Python
The two popular options for benchmarking in Python are: pytest-benchmark and airspeed velocity (asv).
pytest-benchmark is a powerful benchmarking tool
integrated with the popular pytest testing framework.
It allows developers to measure and compare the performance of their code by running benchmarks alongside their unit tests.
Users can easily compare their benchmark results locally
and export their results in various formats, such as JSON.
airspeed velocity (asv) is another advanced benchmarking tool in the Python ecosystem.
One of the key benefits of asv is its ability to generate detailed and interactive HTML reports,
which make it easy to visualize performance trends and identify regressions.
Additionally, asv supports Relative Continuous Benchmarking out of the box.
Both are support by Bencher.
So why choose pytest-benchmark?
pytest-benchmark integrates seamlessly with pytest,
which is the de facto standard unit test harness in the Python ecosystem.
I would suggest using pytest-benchmark for benchmarking your code’s latency,
especially if you are already using pytest.
That is, pytest-benchmark is great for measuring wall clock time.
Refactor FizzBuzz
In order to test our FizzBuzz application, we need to decouple our logic from our program’s main execution. Benchmark harnesses can’t benchmark the main execution. In order to do this, we need to make few changes.
Let’s refactor our FizzBuzz logic into a couple of function:
def play_game(n, should_print): result = fizz_buzz(n) if should_print: print(result) return result
def fizz_buzz(n): if n % 15 == 0: return "FizzBuzz" elif n % 3 == 0: return "Fizz" elif n % 5 == 0: return "Buzz" else: return str(n)play_game: Takes in an integern, callsfizz_buzzwith that number, and ifshould_printisTrueprint the result.fizz_buzz: Takes in an integernand performs the actualFizz,Buzz,FizzBuzz, or number logic returning the result as a string.
Then update then main execution to look like this:
for i in range(1, 101): play_game(i, True)The main execution for our program iterates through the numbers 1 to 100 inclusive and calls play_game for each number, with should_print set to True.
Benchmarking FizzBuzz
In order to benchmark our code, we need to create a test function that runs our benchmark.
At the bottom of game.py add the following code:
def test_game(benchmark): def run_game(): for i in range(1, 101): play_game(i, False) benchmark(run_game)- Create a function named
test_gamethat takes in apytest-benchmarkbenchmarkfixture. - Create a
run_gamefunction that iterate from1to100inclusively.- For each number, call
play_game, withshould_printset toFalse.
- For each number, call
- Pass the
run_gamefunction the thebenchmarkrunner.
Now we need to configure our project to run our benchmarks.
Create a new virtual environment with pipenv:
$ pipenv shell
Creating a Pipfile for this project...Launching subshell in virtual environment... source /usr/bencher/.local/share/virtualenvs/test-xnizGmtA/bin/activateInstall pytest-benchmark inside of that new pipenv environment:
$ pipenv install pytest-benchmarkCreating a Pipfile for this project...Installing pytest-benchmark...Resolving pytest-benchmark...Added pytest-benchmark to Pipfile's [packages] ...✔ Installation SucceededPipfile.lock not found, creating...Locking [packages] dependencies...Building requirements...Resolving dependencies...✔ Success!Locking [dev-packages] dependencies...Updated Pipfile.lock (be953321071292b6175f231c7e2e835a3cd26169a0d52b7b781b344d65e8cce3)!Installing dependencies from Pipfile.lock (e8cce3)...Now we’re ready to benchmark our code, run pytest game.py:
$ pytest game.py======================================================= test session starts ========================================================platform darwin -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)rootdir: /usr/bencher/examples/python/pytest_benchmarkplugins: benchmark-4.0.0collected 1 item
game.py . [100%]
------------------------------------------------- benchmark: 1 tests -------------------------------------------------Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations----------------------------------------------------------------------------------------------------------------------test_game 10.5416 237.7499 10.8307 1.3958 10.7088 0.1248 191;10096 92.3304 57280 1----------------------------------------------------------------------------------------------------------------------
Legend: Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile. OPS: Operations Per Second, computed as 1 / Mean======================================================== 1 passed in 1.68s =========================================================🐰 Lettuce turnip the beet! We’ve got our first benchmark metrics!
Finally, we can rest our weary developer heads… Just kidding, our users want a new feature!
Write FizzBuzzFibonacci in Python
Our Key Performance Indicators (KPIs) are down, so our Product Manager (PM) wants us to add a new feature. After much brainstorming and many user interviews, it is decided that good ole FizzBuzz isn’t enough. Kids these days want a new game, FizzBuzzFibonacci.
The rules for FizzBuzzFibonacci are as follows:
Write a program that prints the integers from
1to100(inclusive):
- For multiples of three, print
Fizz- For multiples of five, print
Buzz- For multiples of both three and five, print
FizzBuzz- For numbers that are part of the Fibonacci sequence, only print
Fibonacci- For all others, print the number
The Fibonacci sequence is a sequence in which each number is the sum of the two preceding numbers.
For example, starting at 0 and 1 the next number in the Fibonacci sequence would be 1.
Followed by: 2, 3, 5, 8 and so on.
Numbers that are part of the Fibonacci sequence are known as Fibonacci numbers. So we’re going to have to write a function that detects Fibonacci numbers.
There are many ways to write the Fibonacci sequence and likewise many ways to detect a Fibonacci number. So we’ll go with the my favorite:
def is_fibonacci_number(n): for i in range(n + 1): previous, current = 0, 1 while current < i: next_value = previous + current previous = current current = next_value if current == n: return True return False- Create a function named
is_fibonacci_numberthat takes in an integer and returns a boolean. - Iterate for all number from
0to our given numberninclusive. - Initialize our Fibonacci sequence starting with
0and1as thepreviousandcurrentnumbers respectively. - Iterate while the
currentnumber is less than the current iterationi. - Add the
previousandcurrentnumber to get thenext_valuenumber. - Update the
previousnumber to thecurrentnumber. - Update the
currentnumber to thenext_valuenumber. - Once
currentis greater than or equal to the given numbern, we will exit the loop. - Check to see is the
currentnumber is equal to the given numbernand if so returnTrue. - Otherwise, return
False.
Now we will need to update our fizz_buzz function:
def fizz_buzz_fibonacci(n): if is_fibonacci_number(n): return "Fibonacci" elif n % 15 == 0: return "FizzBuzz" elif n % 3 == 0: return "Fizz" elif n % 5 == 0: return "Buzz" else: return str(n)- Rename the
fizz_buzzfunction tofizz_buzz_fibonaccito make it more descriptive. - Call our
is_fibonacci_numberhelper function. - If the result from
is_fibonacci_numberisTruethen returnFibonacci. - If the result from
is_fibonacci_numberisFalsethen perform the sameFizz,Buzz,FizzBuzz, or number logic returning the result.
Because we renamed fizz_buzz to fizz_buzz_fibonacci we also need to update our play_game function:
def play_game(n, should_print): result = fizz_buzz_fibonacci(n) if should_print: print(result) return resultBoth our main execution and the test_game function can stay exactly the same.
Benchmarking FizzBuzzFibonacci
Now we can rerun our benchmark:
$ pytest game.py======================================================= test session starts ========================================================platform darwin -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)rootdir: /usr/bencher/examples/python/pytest_benchmarkplugins: benchmark-4.0.0collected 1 item
game.py . [100%]
--------------------------------------------------- benchmark: 1 tests --------------------------------------------------Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations-------------------------------------------------------------------------------------------------------------------------test_game 726.9592 848.2919 735.5682 13.4925 731.4999 4.7078 146;192 1.3595 1299 1-------------------------------------------------------------------------------------------------------------------------
Legend: Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile. OPS: Operations Per Second, computed as 1 / Mean======================================================== 1 passed in 1.97s =========================================================Scrolling back through our terminal history,
we can make an eyeball comparison between the performance of our FizzBuzz and FizzBuzzFibonacci games: 10.8307 us vs 735.5682 us.
Your numbers will be a little different than mine.
However, the difference between the two games is likely in the 50x range.
That seems good to me! Especially for adding a feature as fancy sounding as Fibonacci to our game.
The kids will love it!
Expand FizzBuzzFibonacci in Python
Our game is a hit! The kids do indeed love playing FizzBuzzFibonacci.
So much so that word has come down from the execs that they want a sequel.
But this is the modern world, we need Annual Recurring Revenue (ARR) not one time purchases!
The new vision for our game is that it is open ended, no more living between the bounds of 1 and 100 (even if they are inclusive).
No, we’re on to new frontiers!
The rules for Open World FizzBuzzFibonacci are as follows:
Write a program that takes in any positive integer and prints:
- For multiples of three, print
Fizz- For multiples of five, print
Buzz- For multiples of both three and five, print
FizzBuzz- For numbers that are part of the Fibonacci sequence, only print
Fibonacci- For all others, print the number
In order to have our game work for any number, we will need to accept a command line argument. Update the main execution to look like this:
import sys
args = sys.argvif len(args) > 1 and args[1].isdigit(): i = int(args[1]) play_game(i, True)else: print("Please, enter a positive integer to play...")- Import the
syspackage. - Collect all of the arguments (
args) passed to our game from the command line. - Get the first argument passed to our game and check to see if it is a digit.
- If so, parse the first argument as an integer,
i. - Play our game with the newly parsed integer
i.
- If so, parse the first argument as an integer,
- If parsing fails or no argument is passed in, default to prompting for a valid input.
Now we can play our game with any number!
Run python game.py followed by an integer to play our game:
$ python game.py 9Fizz$ python game.py 10Buzz$ python game.py 13FibonacciAnd if we omit or provide an invalid number:
$ python game.pyPlease, enter a positive integer to play...$ python game.py badPlease, enter a positive integer to play...Wow, that was some thorough testing! CI passes. Our bosses are thrilled. Let’s ship it! 🚀
The End


🐰 … the end of your career maybe?
Just kidding! Everything is on fire! 🔥
Well, at first everything seemed to be going fine. And then at 02:07 AM on Saturday my pager went off:
📟 Your game is on fire! 🔥
After scrambling out of bed, I tried to figure out what was going on. I tried to search through the logs, but that was hard because everything kept crashing. Finally, I found the issue. The kids! They loved our game so much, they were playing it all the way up to a million! In a flash of brilliance, I added two new benchmarks:
def test_game_100(benchmark): def run_game(): play_game(100, False) benchmark(run_game)
def test_game_1_000_000(benchmark): def run_game(): play_game(1_000_000, False) benchmark(run_game)- A micro-benchmark
test_game_100for playing the game with the number one hundred (100) - A micro-benchmark
test_game_1_000_000for playing the game with the number one million (1_000_000)
When I ran it, I got this:
$ pytest game.py======================================================= test session starts ========================================================platform darwin -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)rootdir: /usr/bencher/examples/python/pytest_benchmarkplugins: benchmark-4.0.0collected 3 items
game.py ... [100%]Wait for it… wait for it…
-------------------------------------------------------------------------------------------------- benchmark: 3 tests --------------------------------------------------------------------------------------------------Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------test_game_100 15.4166 (1.0) 112.8749 (1.0) 15.8470 (1.0) 1.1725 (1.0) 15.6672 (1.0) 0.1672 (1.0) 1276;7201 63,103.3078 (1.0) 58970 1test_game 727.0002 (47.16) 1,074.3327 (9.52) 754.3231 (47.60) 33.2047 (28.32) 748.9999 (47.81) 33.7283 (201.76) 134;54 1,325.6918 (0.02) 1319 1test_game_1_000_000 565,232.3328 (>1000.0) 579,829.1252 (>1000.0) 571,684.6334 (>1000.0) 6,365.1577 (>1000.0) 568,294.3747 (>1000.0) 10,454.0113 (>1000.0) 2;0 1.7492 (0.00) 5 1------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend: Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile. OPS: Operations Per Second, computed as 1 / Mean======================================================== 3 passed in 7.01s =========================================================What! 15.8470 us x 1,000 should be 15,847.0 us not 571,684.6334 us 🤯
Even though I got my Fibonacci sequence code functionally correct, I must have a performance bug in there somewhere.
Fix FizzBuzzFibonacci in Python
Let’s take another look at that is_fibonacci_number function:
def is_fibonacci_number(n): for i in range(n + 1): previous, current = 0, 1 while current < i: next_value = previous + current previous = current current = next_value if current == n: return True return FalseNow that I’m thinking about performance, I do realize that I have an unnecessary, extra loop.
We can completely get rid of the for i in range(n + 1): loop and
just compare the current value to the given number (n) 🤦
def is_fibonacci_number(n): previous, current = 0, 1 while current < n: next_value = previous + current previous = current current = next_value return current == n- Update our
is_fibonacci_numberfunction. - Initialize our Fibonacci sequence starting with
0and1as thepreviousandcurrentnumbers respectively. - Iterate while the
currentnumber is less than the given numbern. - Add the
previousandcurrentnumber to get thenext_valuenumber. - Update the
previousnumber to thecurrentnumber. - Update the
currentnumber to thenext_valuenumber. - Once
currentis greater than or equal to the given numbern, we will exit the loop. - Check to see if the
currentnumber is equal to the given numbernand return that result.
Now lets rerun those benchmarks and see how we did:
$ pytest game.py======================================================= test session starts ========================================================platform darwin -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)rootdir: /usr/bencher/examples/python/pytest_benchmarkplugins: benchmark-4.0.0collected 3 items
game.py ... [100%]
------------------------------------------------------------------------------------------------ benchmark: 3 tests ------------------------------------------------------------------------------------------------Name (time in ns) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------test_game_100 309.8685 (1.0) 40,197.8614 (2.38) 322.0815 (1.0) 101.7570 (1.0) 320.2877 (1.0) 5.1805 (1.0) 321;12616 3,104.8046 (1.0) 195120 16test_game_1_000_000 724.9881 (2.34) 16,912.4920 (1.0) 753.1445 (2.34) 121.0458 (1.19) 741.7053 (2.32) 12.4797 (2.41) 656;13698 1,327.7664 (0.43) 123073 10test_game 26,958.9946 (87.00) 129,667.1107 (7.67) 27,448.7719 (85.22) 1,555.0003 (15.28) 27,291.9424 (85.21) 165.7754 (32.00) 479;2372 36.4315 (0.01) 25918 1--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend: Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile. OPS: Operations Per Second, computed as 1 / Mean======================================================== 3 passed in 3.99s =========================================================Oh, wow! Our test_game benchmark is back down to around where it was for the original FizzBuzz.
I wish I could remember exactly what that score was. It’s been three weeks though.
My terminal history doesn’t go back that far.
And pytest-benchmark only stores its results when we ask it to.
But I think it’s close!
The test_game_100 benchmark is down nearly 50x to 322.0815 ns.
And the test_game_1_000_000 benchmark is down more than 500,000x! 571,684,633.4 ns to 753.1445 ns!
🐰 Hey, at least we caught this performance bug before it made it to production… oh, right. Nevermind…
Catch Performance Regressions in CI
The execs weren’t happy about the deluge of negative reviews our game received due to my little performance bug. They told me not to let it happen again, and when I asked how, they just told me not to do it again. How am I supposed to manage that‽
Luckily, I’ve found this awesome open source tool called Bencher. There’s a super generous free tier, so I can just use Bencher Cloud for my personal projects. And at work where everything needs to be in our private cloud, I’ve started using Bencher Self-Hosted.
Bencher has a built-in adapters, so it’s easy to integrate into CI. After following the Quick Start guide, I’m able to run my benchmarks and track them with Bencher.
$ bencher run --adapter python_pytest --file results.json "pytest --benchmark-json results.json game.py"======================================================= test session starts ========================================================platform darwin -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)rootdir: /usr/bencher/examples/python/pytest_benchmarkplugins: benchmark-4.0.0collected 3 items
game.py ...
...
View results:- test_game (Latency): https://bencher.dev/console/projects/game/perf?measures=52507e04-ffd9-4021-b141-7d4b9f1e9194&branches=3a27b3ce-225c-4076-af7c-75adbc34ef9a&testbeds=bc05ed88-74c1-430d-b96a-5394fdd18bb0&benchmarks=077449e5-5b45-4c00-bdfb-3a277413180d&start_time=1697224006000&end_time=1699816009000&upper_boundary=true- test_game_100 (Latency): https://bencher.dev/console/projects/game/perf?measures=52507e04-ffd9-4021-b141-7d4b9f1e9194&branches=3a27b3ce-225c-4076-af7c-75adbc34ef9a&testbeds=bc05ed88-74c1-430d-b96a-5394fdd18bb0&benchmarks=96508869-4fa2-44ac-8e60-b635b83a17b7&start_time=1697224006000&end_time=1699816009000&upper_boundary=true- test_game_1_000_000 (Latency): https://bencher.dev/console/projects/game/perf?measures=52507e04-ffd9-4021-b141-7d4b9f1e9194&branches=3a27b3ce-225c-4076-af7c-75adbc34ef9a&testbeds=bc05ed88-74c1-430d-b96a-5394fdd18bb0&benchmarks=ff014217-4570-42ea-8813-6ed0284500a4&start_time=1697224006000&end_time=1699816009000&upper_boundary=trueUsing this nifty time travel device that a nice rabbit gave me, I was able to go back in time and replay what would have happened if we were using Bencher all along. You can see where we first pushed the buggy FizzBuzzFibonacci implementation. I immediately got failures in CI as a comment on my pull request. That same day, I fixed the performance bug, getting rid of that needless, extra loop. No fires. Just happy users.
Bencher: Continuous Benchmarking
Bencher is a suite of continuous benchmarking tools. Have you ever had a performance regression impact your users? Bencher could have prevented that from happening. Bencher allows you to detect and prevent performance regressions before they make it to production.
- Run: Run your benchmarks locally or in CI using your favorite benchmarking tools. The
bencherCLI simply wraps your existing benchmark harness and stores its results. - Track: Track the results of your benchmarks over time. Monitor, query, and graph the results using the Bencher web console based on the source branch, testbed, benchmark, and measure.
- Catch: Catch performance regressions in CI. Bencher uses state of the art, customizable analytics to detect performance regressions before they make it to production.
For the same reasons that unit tests are run in CI to prevent feature regressions, benchmarks should be run in CI with Bencher to prevent performance regressions. Performance bugs are bugs!
Start catching performance regressions in CI — try Bencher Cloud for free.