Formal MethodsJune 16, 20266 min read

Why Formal Methods Matter: The Calendar Problem

A leap-year bug that hides at the century boundary: why testing samples the haystack and formal verification finds the needle.

By David Kovar

How do you know two implementations of the same function are equivalent?

Not probably equivalent. Not equivalent on the inputs you tested. Actually equivalent - on every possible input, including the ones nobody thought to test.

Most of the time, we don't ask this question carefully. We write tests, do code review, and ship. For many functions, that works well enough. But some functions are wrong in ways that only show up at the boundaries, and those are exactly the cases where test suites are thinnest.

A concrete example

Consider a program that works with dates. At its core are two functions:

def day_number(m: int, d: int, y: int) -> int:
    """Returns a number uniquely identifying this date."""
    if m <= 2:
        m += 12
        y -= 1
    return (365 * y + y // 4 - y // 100 + y // 400
            + (153 * m + 8) // 5 + d)

def days_between_dates(m1: int, d1: int, y1: int,
                       m2: int, d2: int, y2: int) -> int:
    """Returns the number of calendar days between two dates."""
    return day_number(m2, d2, y2) - day_number(m1, d1, y1)

With some helpers:

def is_leap_year(y: int) -> bool:
    return y % 4 == 0 and (y % 100 != 0 or y % 400 == 0)

def days_in_month(m: int, y: int) -> int:
    days = [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    if m == 2 and is_leap_year(y):
        return 29
    return days[m]

def next_date(m: int, d: int, y: int) -> tuple[int, int, int]:
    if d < days_in_month(m, y):
        return m, d + 1, y
    elif m < 12:
        return m + 1, 1, y
    else:
        return 1, 1, y + 1

How do you know this is correct?

You could write tests. But how many? There are roughly 3 million valid dates in a reasonable range. You can't test them all. And the tricky cases - century years, leap year boundaries, year-end rollovers - are exactly the ones most likely to have bugs, and the ones least likely to be in your test suite unless you specifically thought to include them.

The bug

Here's a version of is_leap_year with a common mistake:

def is_leap_year(y: int) -> bool:
    return y % 4 == 0  # Bug: missing century year check

This gets the right answer for the vast majority of years. A year divisible by 4 is a leap year - that's the rule most people carry in memory. The rest of the rule is: except years divisible by 100, which are not leap years, unless also divisible by 400, in which case they are. Year 2000: leap year. Year 1900: not. Year 2100: not.

The buggy version treats century years as leap years. On a test suite that doesn't include century years, it passes. On one that does, it fails with a specific counterexample: y = 100.

The failure cascades from there:

is_leap_year(100) returns True - incorrect, year 100 is not a leap year
days_in_month(2, 100) returns 29 instead of 28
next_date(2, 28, 100) returns (2, 29, 100) instead of (3, 1, 100)
days_between_dates starts returning wrong values for any date range that spans February of a century year

What's notable is that the bug is not in day_number. That closed-form formula correctly handles century years - the - y // 100 + y // 400 terms are right there. The bug is in is_leap_year, which day_number doesn't call directly. It propagates through days_in_month to next_date, and from there into any date comparison that involves those dates.

Two functions in the same codebase, both dealing with leap years, encoded the same rule differently. One correctly. One not.

The equivalence problem

This gets at what makes equivalence hard.

If you're porting this code to a new language, or refactoring day_number for performance, or an AI assistant rewrites something that looks plausible, the question is the same: did you preserve the behavior?

The test-based approach is to run both versions on a set of inputs and check that they agree. The problem is you're looking for the needle - the one input where they disagree - by sampling the haystack. If the needle is at a century-year boundary nobody thought to include, you don't find it.

Formal verification approaches this differently. Instead of searching for a counterexample, you ask a solver to prove one doesn't exist. The solver either returns a proof - the implementations are equivalent on every input - or returns the counterexample itself. For the century-year case, it finds y = 100 without being told to look there.

Formal methods are defined as "mathematically rigorous techniques for the specification, development, analysis, and verification of software and hardware systems." The rigor is what makes this possible: the solver isn't testing; it's reasoning over the entire input space at once.

The boundary insight

The calendar example illustrates something that generalizes:

The hardest bugs aren't in the algorithm - they're at the boundaries.

The day_number formula could be any of a dozen equivalent closed-form expressions. They're all correct. The bugs are in is_leap_year, in year-end rollovers, in month boundaries. These are the places where the domain has rules that are easy to get slightly wrong, or to encode correctly in one place and slightly differently in another.

Testing finds these bugs when you happen to test the right inputs. Formal verification finds them by reasoning over all inputs simultaneously. If a bug exists, the solver returns the specific input that exposes it - not because it guessed, but because it derived it.

What this means for software development

This isn't a problem unique to calendar arithmetic. Any function with domain-specific boundary conditions has this structure: the cases most likely to expose a bug are the ones least likely to appear in your test suite, because you have to know the edge cases in advance to test them.

Date arithmetic. Floating-point rounding at precision limits. Access control checks at role boundaries. Input validation for inputs that are almost-valid. In each case, the bugs that survive code review and testing tend to live at exactly the same kind of boundary the century-year rule represents.

The frequency of the equivalence question has also increased. Software gets refactored, ported, and - increasingly - rewritten by AI. Each time, you need to know whether the behavior was preserved. Test coverage helps, but it's a bet that your samples include the cases that matter.

Formal methods have historically been confined to academic research and safety-critical industries, because applying them in practice required significant expertise. That's the problem we're working on at URSA.

Want to go deeper?

Learn how URSA Secure brings formal verification to your most critical software.

Get in touch