Patterns of flakey Python tests

Flakey tests fail intermittently causing confusion, frustration for developers and delays in your deployment pipeline.

Flakey tests affect all large codebases; the large Python codebases of Kraken Technologies are no exception.

This post details several patterns that cause flakey Python tests. Being aware of these common causes can help when investigating your own flakey tests.

Some advice on fixing flakey tests and general mitigation is also included.


Contents:


Patterns

Here are the common causes of flakey tests we’ve encountered:

Anti-pattern 1: Tight coupling to current time

Some flakey tests only fail when run at a particular point in time, or at a particular time each day.

This can happen if the application code makes flawed assumptions about datetime arithmetic (e.g. assuming the date doesn’t change when a small delta is added to the current time, or when the current datetime is in a daylight saving time transition period).

In our experience, flawed assumptions about datetime arithmetic are the most common cause of flakey tests.

Example: ambiguous datetime form values

Consider this Django form:

from django import forms

class SomeForm(forms.Form):
    due_at = forms.DateTimeField(
        input_formats="%Y-%m-%d %H:%M"
    )

and related test:

from somewhere import forms
from django.utils import timezone

def test_valid_payload():
    due_at = timezone.now()
    form = forms.SomeForm(data={
        "due_at": due_at.strftime("%Y-%m-%d %H:%M")
    })
    assert form.is_valid()

This test will pass for most of the year but fail in the UK Daylight Savings Time transition period where local time moves forward against UTC in October. For example, the value 2021-10-31 01:00:00 is ambiguous when the configured timezone is Europe/London.

This isn’t an application bug per se. It’s reasonable for users to assume datetime values are in their local timezone but not sensible to extend the form widget to handle ambiguous datetimes that only occur for one hour per year in the middle of the night.

The appropriate fix for the test is not to use the system clock to generate the input data but to explicitly specify a fixed datetime:

from somewhere import forms
from django.utils import timezone
import time_machine

def test_valid_payload():
    # Use a fixed point in time.
    due_at = timezone.make_aware(datetime.datetime(2020, 3, 4, 14 30))
    form = forms.SomeForm(data={
        "due_at": due_at.strftime("%Y-%m-%d %H:%M")
    })
    assert form.is_valid()

There will be cases where the system clock call is in the application code rather than the test. In such cases, tests should control system clock calls via a library like time_machine.

import time_machine

@time_machine.travel("2020-03-04T14:30Z")
def test_some_use_case():
    ...

Anti-pattern 2: Calling the system clock at compile time

If the system clock is called at compile time, tests can fail when the test suite is started just before midnight (in the timezone that your test suite uses). In such circumstances, the current date can change during the test run, exposing flawed assumptions about dates and datetimes, ultimately leading to flakey tests.

If you observe test flakiness at a particular time each day, this might be the cause; especially if the test fails due to something related to dates.

Example: factories

Watch out for this anti-pattern when declaring field values in test factories. Here’s an example using FactoryBoy:

import factory
from django.utils import timezone

class SomeFactory(factory.Factory):
    available_from = timezone.now()

Here the value of SomeFactory.available_from will be computed when the test is collected (i.e. import time); but tests that use this factory may not run until several minutes later.

Prefer to use factory.LazyFunction to defer the system clock call until runtime:

import factory
from django.utils import timezone

class SomeFactory(factory.Factory):
    available_from = factory.LazyFunction(timezone.now)

Example: default values for function arguments

Similarly, avoid making system clock calls to provide default argument values:

from datetime import datetime
from django.utils import timezone

def get_active_things(active_at: datetime = timezone.now()):
    ...

In production code, the value of active_at here would correspond to the time the module is imported, which will commonly be when the Python process starts up. This is unlikely to be a relevant value for your application’s logic, and could lead to flakey tests.

Here we factor out the problem by either forcing clients to explicitly pass the argument value:

from datetime import datetime

def get_active_things(active_at: datetime):
    ...

or by using a sentinel value (like None) and adding a guard condition to compute the value if it hasn’t been passed in:

from datetime import datetime
from typing import Optional
from django.utils import timezone

def get_active_things(active_at: Optional[datetime] = None):
    if active_at is None:
        active_at = timezone.now()
    ...

Anti-pattern 3: Implicit ordering

Flakiness can occur in tests making equality assertions on lists where the order of the items isn’t explicitly specified.

For example, a test may fetch a list of results from a database and assert that the results match an expected list. But if the database query doesn’t include an explicit ORDER BY clause, it’s possible the order of the results can vary between test runs.

Example: Django QuerySets

Consider this test which doesn’t specify a sort order for the pizza.toppings.all() QuerySet:

# Factory functions
def _create_pizza(**kwargs):
   ...
def _create_topping(**kwargs):
   ...

def test_creates_toppings_correctly():
    # Create a pizza with some toppings.
    pizza = _create_pizza()
    for topping_name in ("ham", "pineapple"):
        _create_topping(
            pizza=pizza,
            topping_name=topping_name,
        )

    # Fetch all toppings associated with the pizza.
    toppings = pizza.toppings.all()

    assert toppings[0].topping_name == "ham"
    assert toppings[1].topping_name == "pineapple"

At some point, one of your colleagues will have their afternoon ruined when the first assertion finds toppings[0].topping_name is pineapple.

Fix by chaining an explicit order_by call to the QuerySet:

# Factory functions
def _create_pizza(**kwargs):
   ...
def _create_topping(**kwargs):
   ...

def test_creates_toppings_correctly():
    # Create a pizza with some toppings.
    pizza = _create_pizza()
    for topping_name in ("ham", "pineapple"):
        _create_topping(
            pizza=pizza,
            topping_name=topping_name,
        )

    # Fetch all toppings associated with the pizza. We now explicitly sort
    # the QuerySet to avoid future flakiness.
    toppings = pizza.toppings.all().order_by("topping_name")
    assert toppings[0].topping_name == "ham"
    assert toppings[1].topping_name == "pineapple"

Flakiness of this form will happen randomly and can be difficult to recreate locally.

Anti-pattern 4: Randomly generated inputs or fixtures

Tests that use randomly generated input or fixture data can fail intermittently when the generated value exposes a bug in the test or application code.

Of course such “fuzz testing” can be useful for building robust code. However intermittent failures of this form are only useful when they fail for someone working on the code in question. When they fail in an unrelated pull request or in your deploy pipeline workflow, they generally cause frustration.

In such circumstances, the affected person or team is not motivated to fix the root problem as they are likely not familiar with the domain. Instead the path of least resistance is to rerun the test workflow in the hope that the failure doesn’t reappear.

This problem is more pertinent to large codebases, where different teams are responsible for separate domain areas.

Example: randomised names

Consider this test for search functionality that uses faker to randomly generate fixture data:

import factory
from faker import Faker
from myapp import models
from testclients import graphql_client

# Define a factory for generating user objects with randomly 
# generated names.
class User(factory.DjangoModelFactory):
    first_name = Faker().first_name()
    last_name = Faker().last_name()

    class Meta:
        model = models.User


def test_graphql_query_finds_matching_users():
    # This is the search query we will use.
    query = "Kat"

    # Create two users who will match the search query...
    User(first_name="Kate", last_name="Smith")
    User(first_name="Barry", last_name="Katton")

    # ...and two users who won't.
    User(first_name="Catherine", last_name="Parr")
    User(first_name="Anne", last_name="Boleyn")

    # Make requests as an authenticated user (with randomly 
    # chosen name...).
    graphql_client.as_logged_in_user(User())

    # Perform GraphQL query to find matching users.
    query = """query Run($query: String!) {
        users(searchString: $query) {
            edges {
                node {
                    firstName
                    lastName
                }
            }
        }
    }"""
    response = graphql_api_client.post(query, variables={"query": q})

    # Check we get two results.
    assert len(response["data"]["supportUsers"]["edges"]) == 2

This is flakey as the randomly generated name for the requesting user can inadvertently match the search query and give three matching results instead of the expected two.

The fix here is to remove the randomness by explicitly specifying the name of the requesting user. So instead of:

graphql_client.as_logged_in_user(User())

use:

graphql_client.as_logged_in_user(
    User(first_name="Thomas", last_name="Cromwell")
)

As a general rule, you want the tests that run on pull requests and your deploy pipeline to be as deterministic as possible. Hence it’s best to avoid using randomly generated input or fixture data for these scenarios.

Anti-pattern 5: Test pollution

Some flakey tests pass when run individually but fail intermittently when run as part of a larger group. This can happen when tests are coupled in some way and the group or execution order changes causing one test to “pollute” another, leading to failure.

This is perhaps more prevalent when splitting up the test suite to run concurrently (using say pytest-xdist) as new tests may alter the way the test suite is divided.

Common sources of pollution include caches, environment variables, databases, the file system and stateful Python objects. Anything that isn’t explicitly restored to its original state after each test is a possible source of pollution.

Moreover, beware that changing the order that the test suite is run can expose flakiness of this form. It advisable to keep the order of tests deterministic (i.e. don’t shuffle the order each time).

Example: Django’s cache

Caches often couple tests together and cause this pattern of flakiness. For example, Django’s cache is not cleared after each test which can lead to intermittent failure if tests assume they start with an empty cache.

This can be worked around with an auto-applied Pytest fixture:

from django.conf import settings
from django.core.cache import cache
import pytest

@pytest.fixture(autouse=True)
def clear_django_cache():
    # Run the test...
    yield

    # ...then clear the cache.
    cache.clear()

Similarly, be careful with functools.lru_cache as this will need explicitly clearing between tests. A similar Pytest fixture can do this:

import pytest

# We have to explicitly import any relevant functions that are wrapped with the
# LRU cache decorator.
from somemodule import cached_function

@pytest.fixture(autouse=True):
def clear_lru_cache():
    # Execute the test...
    yield

    # ...then clear the cache.
    cached_function.cache_clear()

Alternatively there’s a pytest-antilru Pytest plugin that aims to do the same thing.

Fixing flakey tests

The above anti-patterns provide heuristics for your investigations. When examining a flakey test, ask yourself these questions:

Has the test started failing consistently since some point in time?

If so, look for a hard-coded date or datetime in the test or application code.

Could the failure be explained by bad assumptions around date arithmetic?

This might manifest itself in failure messages that refer to dates or objects associated with dates.

Does the test consistently fail at the same time each day? If so, examine closely the time when the failing test ran and any date logic in the code being executed.

Can you recreate the failure locally?

Try using time_machine to pin the system clock to the exact time when the flakey failure occurred. If this recreates it, rejoice! You can verify that your fix works, which isn’t always possible when working on flakey tests.

Could the failure be explained by randomly generated inputs or fixtures?

Examine the test set-up phase for anything non-deterministic and see if that can explain the failure.

Could the failure be explained by the order of things changing?

This can be harder to spot but look carefully at the error message to see if it’s related to the order of some iterable.

Does the test or application code share a resource with other tests?

Check if the application code uses a cache (especially functools.lru_cache), stateful object or temporarily file and ensure these resources are explicitly restored or removed after each test.

Summary

Knowing the common causes of flakey tests is a huge advantage in mitigating and fixing them.

But they can still be elusive.

We recommend having a policy of immediately skipping flakey tests when they occur and starting an investigation so they can be rapidly fixed and restored to the test suite. This will avoid blocking your deploy pipeline, causing delays and frustration.

This can be done using Pytest’s pytest.mark.skip decorator:

import pytest

@pytest.mark.skip(
    "Test skipped due to flakey failures in primary branch - see "
    "https://some-ci-vendor/jobs/123 "
    "https://some-ci-vendor/jobs/456"
)
def test_something():
    ...

Include links to failing test runs to help recreate and fix the flakey test.

Finally, a theme underlying many flakey tests is a reliance on a non-deterministic factor like system clock calls or randomly generated data. Consequently, to minimise test flakiness, strive to make tests as deterministic as possible.


Thanks to David Seddon and Frederike Jaeger for improving early versions of this post.

Posted by David Winterbottom Head of Software Engineering on May 23, 2022