Formal MethodsJune 16, 20269 min read

Trust but Verify: Using a Neurosymbolic Model in Code Optimization

We let an LLM optimize a well-known code library, then made it prove every change. Here is what survived, what got caught, and where high confidence stops being a proof.

By David Kovar

Ask a modern LLM to optimize a function and it hands you something faster, cleaner, and plausible. The plausibility is the problem. Every suggestion arrives with the same confident tone whether it is correct, subtly wrong on negative inputs, or wrong only when a divisor is zero. So every suggestion costs either a careful human review or a lurking failure, and the review is often harder than writing the optimization yourself, because now you are auditing someone else's reasoning about edge cases instead of your own.

This is the quiet tax on LLM-assisted coding: generation is cheap, trust is expensive. Tests help, but a test only checks the inputs someone thought to write down, and the inputs nobody wrote down are exactly where optimizations go wrong.

There is an older answer to this class of problem: instead of reviewing the change, prove something about it. Formal equivalence checking treats two versions of a function as mathematical objects and asks whether there exists any input on which they disagree. Where a test samples a handful of points, an equivalence check reasons about all of them at once, and when it finds a disagreement it hands back the specific input that triggers it, called a counterexample. (Why formal methods matter: the calendar problem walks through what that buys you.)

So we ran an experiment: wire an LLM and an equivalence checker into one automated loop, point it at real code, and see what survives. The LLM proposes; the checker disposes. That pairing is the neuro-symbolic pattern, neuro for the model that generates candidates, symbolic for the formal methods that check them. This is what happened, including the parts that did not go our way.

The experiment

The target was the maths/ directory of TheAlgorithms/Python, a widely read open-source collection of algorithm implementations. It is a good testbed: the code is real and heavily doctested, and it is the kind of deliberately clear, pedagogical code that often leaves performance on the table.

The pipeline is a single command. Given a target directory and a goal of "optimize," it runs five steps:

Audit coverage. FMToolkit (FMT), our formal equivalence checker, classifies every function by how completely it can model the function's behavior. This produces the candidate list and, just as important, the not-candidate list. (What is FMToolkit?)
Propose. For each candidate, an LLM (Claude) reads the source and proposes an optimized version.
Verify. FMT compares the original against the proposal and returns a per-function verdict: equivalent, not equivalent (with a concrete counterexample), or inconclusive, along with a confidence score. A rejected proposal is fed back with the verdict, for up to three attempts.
Commit or revert. Proposals proven equivalent at or above the confidence threshold for their tier are committed, with the verdict, tier, and confidence recorded in the commit message. Everything else is reverted. No human is in the loop.
Benchmark. Every accepted function is benchmarked against its original on typical and stress inputs.

The maths/ tree contains roughly 400 function definitions across 178 files. The coverage audit placed 214 of those in scope, and a single unattended run committed 28 optimized functions, each with a proof or a high-confidence verdict attached.

Two tiers of "provable"

Not every Python function can be translated completely into a mathematical model. Calls into code the checker cannot see, dynamic typing, and missing type information all limit what it can claim. Rather than treat provability as all or nothing, the audit assigns each function a tier:

Formally analyzable (FA): the function's full behavior can be modeled. A verdict of "equivalent" here is backed by a complete proof, and the pipeline requires confidence 1.0 to commit.
High confidence (HC): the function can be modeled except for small, well-understood gaps. Verdicts carry a confidence score, and the pipeline requires a score above 0.9 to commit. An HC commit is not a complete proof; it is a high-confidence equivalence verdict, and in this run every accepted HC change also passed every doctest in its file.

The rest of the corpus, just under half, fell outside both tiers:

Achievable with annotations: functions that would become analyzable once type annotations are added or tightened.
Achievable with modification: functions that need small structural changes before the checker can model them.
Out of scope today: constructs the checker does not model yet.

The number to watch is the in-scope fraction, because every gain in coverage multiplies everything downstream.

What the run produced

Of the 214 candidate functions across both tiers:

Outcome	Count
Proven equivalent, committed	28
Proven not equivalent, reverted (counterexample produced)	7
Equivalent verdict below tier threshold, reverted	11
Inconclusive, reverted	30
Skipped (proposal not verifiable)	53
Proposal removed the function, reverted	8
Tooling errors / verification timeouts	14
Unchanged (LLM declined to modify)	1
No proposal attempted	62

The 28 commits split evenly across the tiers: 14 FA, every one at confidence 1.0, and 14 HC at 0.95 or above. Each landed as its own commit with the verdict embedded:

3d95df53 FMT-approved: floor (confidence=1.0000, tier=formally_analyzable)
e2a7aab2 FMT-approved: binary_mod_multiply (confidence=0.9500, tier=high_confidence)

Two of the rejection categories are the system refusing to extend trust. The 53 skips are mostly cases where the proposed rewrite used a construct outside FMT's analyzable subset; FMT declined to rule, so the pipeline discarded the proposal unexecuted. A proposal that cannot be verified is treated exactly like one that failed verification. The 11 below-threshold rejections are equivalence verdicts whose confidence fell short of the tier requirement.

The benchmarks for a sample of the accepted changes, typical and stress inputs (these are highlights from the full set of standing changes):

Function	Tier	Typical	Stress
binary_mod_multiply	HC	39.5x	178.4x
additive_persistence	HC	4.4x	n/a
square_root_iterative	HC	4.3x	4.1x
multiplicative_persistence	HC	3.2x	2.9x
jacobi_symbol	HC	2.8x	2.2x
area_square	FA	2.5x	2.5x
ceil	FA	2.3x	2.5x
floor	FA	2.0x	2.3x

Honest readings: not every accepted change is a pure win. Elsewhere in the full results, one function got slower on stress inputs and another's stress speedup was a wash. Equivalence checking proves behavior is preserved; it says nothing about performance, which is why the pipeline benchmarks after committing rather than trusting the LLM's predicted speedup.

What the winning optimizations look like

The committed changes are not exotic. They are the changes an experienced Python developer would make, which is the point: the LLM already knows these idioms, and the proof is what lets you accept them without a careful line-by-line review.

The 39.5x winner replaced a bit-twiddling loop with the thing Python already does well.

Before:

def binary_mod_multiply(a: int, b: int, modulus: int) -> int:
    res = 0
    while b > 0:
        if b & 1:
            res = ((res % modulus) + (a % modulus)) % modulus
        a += a
        b >>= 1
    return res

After:

def binary_mod_multiply(a: int, b: int, modulus: int) -> int:
    if b <= 0:
        return 0
    return (a * b) % modulus

Binary multiplication earns its keep in fixed-width languages, where multiplying first and reducing afterward can overflow a machine integer. Python integers are arbitrary precision, so the loop is pure overhead, and on large inputs the gap widens to 178x. The detail worth noticing is the guard the LLM added: the original loop returns 0 whenever b <= 0, so a faithful replacement must preserve that, even though it is almost certainly not behavior anyone designed. A version without the guard would not have passed verification.

The ones that got caught

Seven proposals were proven not equivalent, each with a concrete counterexample. The most instructive is vol_hemisphere, where the LLM's own logged reasoning shows it trying hard to be safe:

"But multiplication order matters for IEEE 754. The original does pow(r,3) * pi * 2 / 3... To be safe and match exactly, I'll keep the same arithmetic but replace pow() with multiplication."

The LLM understood the hazard, reasoned about it explicitly, and concluded its rewrite was safe. FMT disagreed and returned the counterexample radius = 1000000.0. Floating-point arithmetic is not associative: regroup algebraically identical operations and the last bits of the result can change. No test suite written by hand checks for that, and an informal argument about "keeping the same arithmetic" is exactly the kind of reasoning that feels rigorous and is not. Five of the seven counterexamples were floating-point divergences on geometry functions; the other two were integer edge cases at zero modulus. In every case the divergent input is an edge nobody tests, and the file's doctests pass on every one of these rejected proposals.

Why the neuro-symbolic split earns its keep

The trust boundary is explicit and graded. Nothing reaches the git history on the strength of the LLM's self-assessment, the benchmark, or the test suite. It gets there with a machine-generated verdict, and the verdict's strength travels with it. A passing test suite says "we found no disagreement where we looked." An FA proof says "there is no disagreement to find." An HC verdict says "no disagreement within what we modeled, and here is a number for how much we modeled." The reader chooses how much trust each sentence earns.

The two halves fail differently, and that is the point. The neural half is creative and unreliable; the symbolic half is rigid and sound. The LLM does not need to be right, it needs to be right often enough to be worth checking, and 28 accepted proposals from a single unattended pass clears that bar. For everyday maintenance of existing code, nobody has a written specification, and the trick is that nobody needs one: the original implementation is the specification, and equivalence against it is checked over all inputs rather than the ones a test happens to run. Provenance comes for free: every commit carries its verdict, tier, and confidence, so if a change ever needs to be revisited, finding and reverting it is trivial.

Limitations, honestly stated

The funnel is wide at the top and narrow at the bottom, and it would be misleading to pretend otherwise.

Inconclusive is the most common outcome among completed verifications. Thirty ended inconclusive, all discarded. Some were probably correct; the pipeline does not care, which is the right default and an honest cost.
Coverage gates everything. Only functions the checker can model are candidates, just over half of this corpus. Richer type annotations widen the pool, which inverts a familiar incentive: annotations stop being documentation and start being optimization surface.
The HC tier admits gaps. A high-confidence verdict is not a proof. Closing those gaps takes other approaches, including human review, so HC results are best treated as a strong signal to confirm rather than a final word.
Equivalence warrants behavior, not speed. At least one accepted change regressed on stress inputs. The claim worth defending is not "this made the repository faster" but "every standing change preserved behavior at near-zero review cost, and the measured wins came along free."

Where this goes

The optimize goal is one instance of a more general loop: the LLM proposes, the formal method disposes. The same architecture applies to refactoring, dead-code removal, dependency upgrades, or any transformation whose contract is "behavior must not change." The neural side supplies intent and volume; the symbolic side supplies the warranty, graded by how much of the code it could model.

Stated plainly: an unattended process read 214 functions from a well-known public repository, committed 28 optimizations with machine-checked verdicts attached, rejected everything it could not verify, and produced counterexamples for seven plausible-looking proposals that were subtly wrong. Every commit carries the verdict, tier, and confidence that justified it. Fourteen of the changes are formally proven. No human reviewed anything until the evidence was already in the log.

Want to go deeper?

Learn how URSA Secure brings formal verification to your most critical software.

Get in touch