AI in the Evidence Pipeline: The Compounding Problem -

A single AI-generated error does not stay contained. It spreads through continuous monitoring, the authorization decision, and the assessment meant to catch it.

Not long ago, a venture-backed compliance-automation company was caught delivering security reports that turned out to be fabricated: hundreds of them, accepted as the official record of systems they did not accurately describe.

The story got told as an accuracy problem. The AI was wrong. The AI made things up.

That is the wrong lesson.

The accuracy was never the interesting part. Every automated system is wrong sometimes. The interesting part is what those reports did after they were generated: they were signed, delivered to auditors, and acted on. They became the record everything else trusted.

A wrong answer in an evidence pipeline does not stay one wrong answer.

That is because an evidence pipeline is not built to display a result. It is built to carry one. Three properties make it that way:

Chained. Every result is signed and written into an append-only log. Once it is in, it is permanent.
Continuous. The pipeline does not run once. Under FedRAMP 20x it re-runs at least every 72 hours, and each cycle is measured against the one before it.
Trusted downstream. The POA&M, the System Security Plan, and the Authorizing Official all consume what the pipeline produces. They do not re-check it. Trusting it is the design.

Those three properties are the entire value of a modern evidence pipeline. They are also, exactly, its transmission path.

Put a language model inside that pipeline and you have not added a smart assistant. You have added a source of errors to a system engineered to spread them.

This is the compounding problem.

How One Wrong Answer Compounds Across the Compliance Stack

Follow a single false result through the system. Not a dramatic one, just an ordinary false pass: one control reported as met that is not.

It Poisons Continuous Monitoring

Continuous monitoring is not valuable because it produces a snapshot. It is valuable because it produces a trend: the same checks, cycle after cycle, so that change becomes visible.

That is what drift detection is. This cycle’s state is measured against the last known baseline. If the baseline is correct, real configuration drift stands out immediately.

If the baseline is a false pass, the comparison breaks in both directions. Genuine drift away from a control that was never actually met reads as “stable.” Or ordinary variation reads as drift, and the team chases noise.

And the error is not corrected on the next run. It becomes the baseline the next run trusts. The monitoring keeps going, on schedule and looking healthy. It is now monitoring a fiction.

It Corrupts the Authorization Decision

Continuous authorization means the authority to operate is no longer a binder approved on a Tuesday. It is re-derived, continuously, from current evidence.

That is the goal. It is also the exposure.

A false pass is now a false input to a live authorization decision. The Authorizing Official is weighing risk against a posture that does not exist, accepting a system as authorized on the strength of a control that is not there.

The authorization decision sits at the top of the stack. It consumes everything beneath it. Every compounding error in continuous monitoring, in the POA&M, in the SSP eventually arrives here: at the one decision with the highest consequence and the least ability to see that its inputs were wrong.

It Hides Risk in the POA&M

The Plan of Action and Milestones is the official register of known risk. Its job is to make sure nothing falls through.

A false pass closes a POA&M item that should be open, or stops one from ever being opened.

The risk did not go away. It went invisible. It is now unremediated and unrecorded, sitting quietly in the known-good column where no one will look for it.

A false fail is the cheaper mistake. It generates phantom remediation work, wasted hours, but the risk stays visible. The false pass is the dangerous one, because the entire purpose of a compliance program is to surface risk, and this is the failure mode where it does the opposite.

It Drifts the SSP From Reality

The System Security Plan declares how each control is implemented. Under FedRAMP 20x, the implementation status attached to those controls is meant to be machine-readable, expressed in OSCAL, and backed by current evidence, not a claim typed in once and left.

A false pass writes “implemented” against a control that is not.

The SSP is not a terminal document. It is the source of truth the next assessment reads first. Once a control’s implementation status is corrupted there, the error is no longer one bad data point downstream. It is upstream of everything that happens next.

And the Assessment Can’t Catch It

The 3PAO is supposed to be the backstop: the independent assessment that catches what the system got wrong.

Under 20x, the assessor’s job changed. They no longer hand-verify every result. They evaluate the validation process and trust the evidence that process produces. That shift is deliberate and correct. It is what makes continuous assessment possible at all.

But it means the backstop only works if the evidence is sound.

If the evidence is AI-generated and compounding, the security assessment does not catch the error. It inherits it. The independent check is now standing inside the blast radius, certifying it.

Why It Can’t Be Contained

A fair objection: deterministic systems are wrong too. A human writes a bad policy, the engine evaluates it faithfully, and the result is wrong.

True. But there is a decisive difference, and it is the whole argument.

A deterministic error is containable.

A bad policy produces the same wrong result every time it runs. That means you can reproduce it. You can diff it against expected behavior. You can find the policy that caused it and fix it. Because the logic is fixed, you can enumerate exactly which results, across which systems and which cycles, were affected. The error has edges.

An AI error has no edges.

A language model returns, for the same input, a possibly different output. That is not a defect to be tuned away; it is how the technology works. And it means an AI error cannot be reproduced, cannot be bounded, cannot be audited after the fact. You cannot answer the only question that matters during a cleanup: which results are wrong?

Scale turns that from a flaw into a crisis. A scanner that is 99% accurate, running tens of thousands of evaluations a month against a 72-hour cadence, produces hundreds of wrong results every month. Not one of them is reproducible. Not one is traceable.

That is the compounding problem stated completely: many errors enter, each one spreads through the stack, and none of them can be found and pulled back out.

A Deterministic Pipeline Contains Errors

This is why ProofLayer’s evidence pipeline is deterministic end to end. There is no model anywhere on it.

A policy is written in ESP, a declarative and human-readable format, and compiled by a strict seven-pass compiler that rejects anything malformed before it can run. The engine evaluates that policy against observed state, produces a pass or fail, hashes the result, signs it, and appends it to the transparency log. Same policy, same state, same result, every run.

ProofLayer can still be wrong. Write a bad policy and you get a bad result. We are not claiming the absence of error.

We are claiming the absence of compounding. A ProofLayer error is deterministic: reproducible, locatable, fixable, with a blast radius you can enumerate down to the exact result. You find it, you correct the policy, you know precisely what to re-run. The error has edges. It stays contained.

ESP, the language and engine at the core of the pipeline, is open source. You do not have to take our word for any of this. You can read exactly how a result is produced.

The Bar for the Pipeline

FedRAMP 20x makes the evidence pipeline more continuous and more chained than it has ever been. Those are the exact conditions under which compounding gets worse, not better: faster cycles, a longer chain, more downstream consumers trusting the result without re-checking it.

In a system that is chained, continuous, and trusted downstream, the evidence layer is the one place an error has to stay contained. Everything else in the compliance stack is built on the assumption that it does.

That is the bar for anything you put in the pipeline. A language model does not clear it.

If you are evaluating compliance automation and trying to tell the difference between a result you can verify and one you can only trust, that is a conversation worth having. Reach the ScanSet engineering team at contact@scanset.io.