Technical Overview

A better model is not enough.
Data needs a better harness.

By Will Bunting and Forest Fang · April 2026

Frontier models can produce fluent SQL and confident prose. That still does not guarantee correct, reproducible analysis for business-critical questions.

A couple of years ago, we realized this firsthand: model output on real business data was often only mostly right. Data work is less deterministic than code work. In software, compilers, type checkers, linters, and tests expose mistakes quickly. In analytics, there is no equivalent default harness to catch subtle metric and logic errors before they are trusted.

We are packaging a formal benchmark for open release. This page shares our current internal methodology and why results from a naive Claude setup can plateau at "mostly correct" in enterprise scenarios, while our advanced data science harness is designed to close that reliability gap.

Core thesis

Why the harness matters

When we started building Vuon, frontier models were already writing impressive SQL. That wasn't the problem. The problem was that they kept getting it subtly wrong — not in ways that crashed a query, but in ways that produced a confident, plausible, runnable answer that was incorrect for the specific business we were analyzing.

The failure pattern we kept seeing was this: a query that is completely correct for one company in a vertical is wrong for a different company in the same vertical, because of how that organization defines its metrics. Cohort analysis is a simple example. To avoid survivorship bias you need to count users at the time of their conversion event, not at the current date. But that's not universal — different organizations define conversion differently, track it in different tables, and apply different exclusions. A model without access to those specifics has no reliable way to know which answer is correct. It will pick one, and it will sound confident either way.

We realized the gap wasn't the model. The gap was the absence of corrective infrastructure.

In software engineering, correctness has a harness around it. A compiler tells you when code is type-wrong before it runs. A linter flags structural problems. A test suite catches regressions. The codebase itself provides context that constrains what valid code looks like. These mechanisms don't just help developers write better code — they give agents a tight feedback loop. When a coding agent gets something wrong, the environment tells it.

Data has none of that. You can feed schema documentation and metric definitions into a prompt, and we do — but that turns out to be necessary, not sufficient. It doesn't scale without a system behind it. It can't enforce anything. And it can't tell a model that a query ran successfully but is still wrong for this business, on this question, at this point in time.

That's the layer we've spent the last year building.

Analysis

Deep dives

Why churn-pattern analysis fails without confound checks

Do organizations that churn show different dashboard creation or abandonment patterns before leaving?

Baseline miss

Claude tends to report a clean temporal trend (abandonment appears to decline near churn) without accounting for short-tenure churners that flood the final window and distort the signal.

Vuon correction

Vuon enforces cohort and time-window checks before finalizing the narrative, so confounded windows are surfaced and the analysis is pushed toward the stable signal: churned orgs show both lower volume and higher abandonment.

Why static-size comparisons hide the real retention driver

How do single-user organizations compare to multi-user organizations in engagement, conversion, and retention?

Baseline miss

Claude often classifies organizations by current size and leans on per-org averages, which bakes in survivorship bias and can miss that growth trajectory drives retention more than initial size.

Vuon correction

Vuon keeps trajectory cohorts explicit (single-to-single vs single-to-multi vs multi-to-multi) and normalizes engagement correctly, which preserves the key conclusion that post-conversion team growth is the strongest retention signal.

Architecture

Harness architecture at a glance

This is the execution path we evaluate. The model is one part of the system, but reliability comes from what happens around the model before and after query execution.

1

Task + domain context

The agent receives the question, schema metadata, and policy-aware metric definitions.

2

Context-aware SQL compilation

Candidate SQL is checked against business logic before execution.

3

Warehouse execution

Queries run on the same warehouse used by the baseline comparison.

4

Post-execution validation

Results are checked for denominator drift, impossible transitions, and unstable windows.

5

Tracked calculations + reruns

Intermediate artifacts are versioned and replayable before final delivery.

Evaluation

Internal benchmark snapshot

The evaluation results confirmed what we were observing in practice: once tasks require business-definition fidelity, correction loops, and reproducibility, the gap between a naive setup and a purpose-built harness is large and consistent.

Each scenario compares Vuon with a Claude baseline that receives the same schema and metric context, but not Vuon's semantic compiler, validation passes, or calculation-versioning layer.

These are internal results, not the final benchmark package. They illustrate the recurring pattern: once tasks require definition fidelity, correction loops, and reproducibility, the baseline falls behind sharply.

Average score across 6 eval cases

Vuon logoVuon
95
Claude logoNaive baseline
72

Scoring basis: adjudicated answer correctness, business-definition adherence, and reproducibility.

Eval case

Free-to-paid conversion rate

SQL

What is the free-to-paid conversion rate?

Vuon logoVuon

99/100

Claude logoNaive baseline

97/100

Session concentration by tenure

SQL

What fraction of total sessions last year came from the top 10% of users, and how does this concentration vary by user tenure on the platform?

Vuon logoVuon

91/100

Claude logoNaive baseline

62/100

Churn abandonment patterns

Analysis

Do organizations that churn show different dashboard creation or abandonment patterns before leaving?

Vuon logoVuon

96/100

Claude logoNaive baseline

70/100

Org size impact on churn and upgrade

SQL

How does organization size (by user count) affect churn and upgrade rates?

Vuon logoVuon

89/100

Claude logoNaive baseline

44/100

Lifecycle upgrade vs churn timing

SQL

When in their lifecycle do organizations typically upgrade vs churn? Is there a critical retention window?

Vuon logoVuon

99/100

Claude logoNaive baseline

96/100

Single-user vs multi-user organizations

Analysis

How do single-user organizations compare to multi-user organizations in engagement, conversion, and retention?

Vuon logoVuon

95/100

Claude logoNaive baseline

60/100

Methodology

How we run the comparison

The goal is not to show that prompting is useless. The goal is to isolate what infrastructure is required to turn a strong model into a reliable data analysis system.

01

Fix the task set

We define a fixed set of recurring data analysis tasks with a known or adjudicable answer: metric reconstruction, decomposition, cohorting, and variance analysis. These are tasks where correctness can be checked, not just tasks where prose can sound convincing.

02

Equal information, different execution

To ensure a fair comparison, we provide the Claude baseline with the same table schemas, metric definitions, and domain context that our agent receives. Both systems connect to the same warehouse. The comparison isolates the execution layer: what each system does with the same information.

03

Score against verifiable outputs

We score whether the system reached the accepted answer, whether it stayed inside business definitions, and whether the final analysis could be rerun to the same result. This favors systems that show their work instead of systems that optimize for persuasive narration.

04

Separate current snapshot from the future benchmark

The figures below reflect our current internal evaluation snapshot. In parallel, we are formalizing a more rigorous benchmark package that we expect to open source, including methodology, scoring rubrics, and reproducible task definitions.

A fair benchmark already assumes both systems can execute queries and access the same domain context. The harder question is whether the system can tell the agent its SQL is wrong even when the database runs it, or that a narrative is unsupported even when the numbers look superficially plausible.

System components

What sits inside the harness

The improvement does not come from one trick. It comes from stacking mechanisms that apply correction pressure before a wrong answer hardens into a final deliverable.

Contextually aware SQL compiler

We spent substantial research effort building a SQL compilation layer that understands warehouse structure and business definitions. It flags misaligned queries in the same spirit that a code compiler flags undefined variables or invalid types.

Semantic and policy graph

Metric definitions, ownership, exclusions, and approved logic are represented as operational context rather than as loose prose stuffed into a prompt. That turns semantic drift into something the agent can detect and correct.

Post-execution validation

After execution, the system evaluates outputs across multiple statistical and structural dimensions. It checks for impossible values, denominator drift, discontinuities, and other signatures that the query or the underlying data may be wrong.

Tracked and versioned calculations

Every calculation the agent makes is tracked, versioned, and available to rerun as the analysis develops. That gives the system a reproducibility loop and lets it catch its own inconsistencies before the answer is delivered.

We use the term SQL compiler intentionally. It is not a thin query transport. It is a context-aware compiler that can tell the agent when candidate SQL is out of alignment with metric definitions, segment semantics, or exclusion policy.

Post-execution validation matters just as much. Some queries are syntactically correct and semantically plausible, yet still wrong because they produce impossible transitions, unstable denominators, or implausible distributions.

Reproducibility is part of the correctness loop. Tracked and versioned calculations let the agent rerun parts of its own analysis and catch inconsistencies before delivery.

Comparison

System-level differences

Both systems can connect to a warehouse and generate SQL. The divergence appears when tasks require business-definition fidelity, repeated self-correction, and durable analytical state.

DimensionNaive baselineVuon
Information accessSchema snippets and metric definitions provided via context window, with direct SQL execution.Same base access plus structured warehouse metadata, semantic graph, execution traces, and reusable artifacts.
Business definition handlingDefinitions are mostly passive text in prompts, which means the model can ignore, reinterpret, or partially apply them.Definitions are represented in a structured layer that can reject or redirect candidate work when logic departs from policy.
SQL error correctionCorrection starts after the database or a person complains. Many semantic errors survive execution because the SQL is syntactically valid.Correction starts before and after execution. The compiler and validation passes both contribute signals that the agent can act on.
Intermediate state managementIntermediate calculations often live only in the model conversation, so the chain of reasoning is fragile and hard to replay.Intermediate calculations are explicit, versioned artifacts that can be reused, rerun, and audited.
Failure modePlausible but wrong analysis that appears complete because the narrative is coherent.Visible correction loops that raise uncertainty, request another pass, or stop before a false answer is presented as final.

This is why we do not expect the baseline to simply "catch up" by waiting for a better model. The model can improve while the system still lacks the domain-specific repair loops required for trustworthy analytics.

What this implies for evaluators

If two systems share the same context and frontier model class, a persistent gap of this size points to infrastructure, not prompt wording. The relevant procurement question is whether the system can enforce definitions, challenge its own outputs, and replay its analytical path under scrutiny.

In recurring enterprise tasks with verifiable answers, Vuon reaches the correct answer when the knowledge graph is populated and the harness can exercise correction loops. The baseline usually does not, and we have not seen model upgrades alone close that reliability gap.

Our benchmark release is intended to make that question testable, with task definitions and scoring that reward correctness and reproducibility rather than rhetorical confidence.