You Can Measure "Judgment Work" — Here's How a Bank Would

A copywriter opened her AI’s draft on a Monday and felt it was off. Not wrong — off. By Friday she felt it again, and the week after that, and could never say by how much. For 35 years I watched banks face the same nameless unease about risk, and answer it the same way every time: not by trusting the feeling, but by scoring it. A standard, a number, a trend. The feeling she couldn’t name was a thing she could have measured.

How do I measure the quality of my AI’s output?

You measure it by scoring the output against a standard you define — what good means for your business — on a regular cadence. You do not need a perfect metric. You need a useful one: a weekly number, scored the same way each time, that lets you see a trend. The moment AI output is measured, drift cannot hide. This is the same discipline a bank uses to measure risk, applied to the AI you run your business on.

“But my work is judgment work. You can’t measure that.”

I have heard this a hundred times, and I understand exactly why people believe it. The only way you have ever experienced AI is without a standard, without a review, without a number. Just vibes. So measuring it sounds impossible, or worse, like pretending.

Here is the thing. I spent 35 years measuring things people swore could not be cleanly measured — risk, exposure, the probability that a number was wrong — at two of North America’s largest banks. Judgment-heavy, high-stakes, “you just have to feel it” work. It can be measured. Not perfectly. Usefully. Enough to see a trend. And the moment you can see a trend, something that was invisible becomes obvious. (The Real Enemy Is Drift)

The method: a standard, a score, a cadence

To measure AI output quality, define a written standard for what good looks like, score each output against it on a 1–5 scale, and track the score on a weekly cadence. The absolute number matters less than the trend — a measured trend is what makes drift visible and improvement provable.

Three moving parts, and none of them is hard.

A standard. Two sentences describing what good output looks like for one task. “A good client email is warm, under 120 words, no bullet lists, ends with one clear next step.” That is a standard. You just made it measurable. (AI That Learns From Your Corrections)
A score. Rate each output against the standard, 1 to 5. Be consistent, not precise. The same rater, the same standard, every week.
A cadence. Weekly. One score, one task, every week. Now you have a line on a graph instead of a feeling.

Why “not perfectly, usefully” is the whole point

People reject measurement because they imagine it has to be exact. A bank does not measure risk to four decimal places of truth — it measures it usefully, consistently, enough to act on the trend. Your AI quality score works the same way. A rough-but-consistent 1–5 you actually track beats a perfect metric you never build. Consistency, not precision, is what reveals the trend. (The Month-Six Test)

What measurement unlocks

Once you can see the trend, three things become possible that were impossible before:

You catch drift the week it starts — not three months later when you finally notice the AI quietly stopped being useful.
You can prove improvement — “the score went from 3.1 to 4.2 over six weeks” is evidence, not a vibe. That is the difference between a pile and a system you can trust. (Stateful vs. Stateless AI)
You can improve at all — because if you cannot measure it, you cannot improve it. Measurement is the first move; everything else in an engineering-grade system depends on it.

Try this now (5 minutes)

Pick the single task you hand to AI most often.
Write two sentences: what does good output look like for it?
Score this week’s output against those two sentences, 1 to 5.
Put the number and the date somewhere you will see it next week.
Next week, score again. You now have a trend — the first measurement of AI quality you have ever taken.

Stop — this counts. That two-sentence standard is the first brick of an engineering-grade system, and you built it in five minutes, for free.

Frequently asked questions

Isn’t a 1–5 score too crude to be meaningful? Crude and consistent beats precise and absent. The trend is the signal, and a consistent 1–5 reveals the trend perfectly well. Precision can come later; the trend is available today.

Who does the scoring — me or the AI? Start with you, because you own the standard. A mature engineering-grade system can help score against your standard automatically, but the standard is always yours to set, edit, and own. (AI Assistant vs. AI Operating System)

How is this different from just “reviewing” the output like I already do? Reviewing is one-off and unrecorded — it vanishes. Measuring is recorded and tracked, so it accumulates into a trend you can act on. The difference between reviewing and measuring is the difference between a vibe and a number.