Technical Overview
A better model is not enough.
Data needs a better harness.
By Will Bunting and Forest Fang · April 2026
Frontier models can produce fluent SQL and confident prose. That still does not guarantee correct, reproducible analysis for business-critical questions.
A couple of years ago, we realized this firsthand: model output on real business data was often only mostly right. Data work is less deterministic than code work. In software, compilers, type checkers, linters, and tests expose mistakes quickly. In analytics, there is no equivalent default harness to catch subtle metric and logic errors before they are trusted.
We are packaging a formal benchmark for open release. This page shares our current internal methodology and why results from a naive Claude setup can plateau at "mostly correct" in enterprise scenarios, while our advanced data science harness is designed to close that reliability gap.
Core thesis
Why the harness matters
When we started building Vuon, frontier models were already writing impressive SQL. That wasn't the problem. The problem was that they kept getting it subtly wrong — not in ways that crashed a query, but in ways that produced a confident, plausible, runnable answer that was incorrect for the specific business we were analyzing.
The failure pattern we kept seeing was this: a query that is completely correct for one company in a vertical is wrong for a different company in the same vertical, because of how that organization defines its metrics. Cohort analysis is a simple example. To avoid survivorship bias you need to count users at the time of their conversion event, not at the current date. But that's not universal — different organizations define conversion differently, track it in different tables, and apply different exclusions. A model without access to those specifics has no reliable way to know which answer is correct. It will pick one, and it will sound confident either way.
We realized the gap wasn't the model. The gap was the absence of corrective infrastructure.
In software engineering, correctness has a harness around it. A compiler tells you when code is type-wrong before it runs. A linter flags structural problems. A test suite catches regressions. The codebase itself provides context that constrains what valid code looks like. These mechanisms don't just help developers write better code — they give agents a tight feedback loop. When a coding agent gets something wrong, the environment tells it.
Data has none of that. You can feed schema documentation and metric definitions into a prompt, and we do — but that turns out to be necessary, not sufficient. It doesn't scale without a system behind it. It can't enforce anything. And it can't tell a model that a query ran successfully but is still wrong for this business, on this question, at this point in time.
That's the layer we've spent the last year building.
Analysis
Deep dives
Why churn-pattern analysis fails without confound checks
Do organizations that churn show different dashboard creation or abandonment patterns before leaving?
Baseline miss
Claude tends to report a clean temporal trend (abandonment appears to decline near churn) without accounting for short-tenure churners that flood the final window and distort the signal.
Vuon correction
Vuon enforces cohort and time-window checks before finalizing the narrative, so confounded windows are surfaced and the analysis is pushed toward the stable signal: churned orgs show both lower volume and higher abandonment.
Why static-size comparisons hide the real retention driver
How do single-user organizations compare to multi-user organizations in engagement, conversion, and retention?
Baseline miss
Claude often classifies organizations by current size and leans on per-org averages, which bakes in survivorship bias and can miss that growth trajectory drives retention more than initial size.
Vuon correction
Vuon keeps trajectory cohorts explicit (single-to-single vs single-to-multi vs multi-to-multi) and normalizes engagement correctly, which preserves the key conclusion that post-conversion team growth is the strongest retention signal.
Architecture
Harness architecture at a glance
This is the execution path we evaluate. The model is one part of the system, but reliability comes from what happens around the model before and after query execution.
Task + domain context
The agent receives the question, schema metadata, and policy-aware metric definitions.
Context-aware SQL compilation
Candidate SQL is checked against business logic before execution.
Warehouse execution
Queries run on the same warehouse used by the baseline comparison.
Post-execution validation
Results are checked for denominator drift, impossible transitions, and unstable windows.
Tracked calculations + reruns
Intermediate artifacts are versioned and replayable before final delivery.
Evaluation
Internal benchmark snapshot
The evaluation results confirmed what we were observing in practice: once tasks require business-definition fidelity, correction loops, and reproducibility, the gap between a naive setup and a purpose-built harness is large and consistent.
Each scenario compares Vuon with a Claude baseline that receives the same schema and metric context, but not Vuon's semantic compiler, validation passes, or calculation-versioning layer.
These are internal results, not the final benchmark package. They illustrate the recurring pattern: once tasks require definition fidelity, correction loops, and reproducibility, the baseline falls behind sharply.
Average score across 6 eval cases
Scoring basis: adjudicated answer correctness, business-definition adherence, and reproducibility.
Free-to-paid conversion rate
SQLWhat is the free-to-paid conversion rate?
99/100
97/100
Session concentration by tenure
SQLWhat fraction of total sessions last year came from the top 10% of users, and how does this concentration vary by user tenure on the platform?
91/100
62/100
Churn abandonment patterns
AnalysisDo organizations that churn show different dashboard creation or abandonment patterns before leaving?
96/100
70/100
Org size impact on churn and upgrade
SQLHow does organization size (by user count) affect churn and upgrade rates?
89/100
44/100
Lifecycle upgrade vs churn timing
SQLWhen in their lifecycle do organizations typically upgrade vs churn? Is there a critical retention window?
99/100
96/100
Single-user vs multi-user organizations
AnalysisHow do single-user organizations compare to multi-user organizations in engagement, conversion, and retention?
95/100
60/100
Methodology
How we run the comparison
The goal is not to show that prompting is useless. The goal is to isolate what infrastructure is required to turn a strong model into a reliable data analysis system.
Fix the task set
We define a fixed set of recurring data analysis tasks with a known or adjudicable answer: metric reconstruction, decomposition, cohorting, and variance analysis. These are tasks where correctness can be checked, not just tasks where prose can sound convincing.
Equal information, different execution
To ensure a fair comparison, we provide the Claude baseline with the same table schemas, metric definitions, and domain context that our agent receives. Both systems connect to the same warehouse. The comparison isolates the execution layer: what each system does with the same information.
Score against verifiable outputs
We score whether the system reached the accepted answer, whether it stayed inside business definitions, and whether the final analysis could be rerun to the same result. This favors systems that show their work instead of systems that optimize for persuasive narration.
Separate current snapshot from the future benchmark
The figures below reflect our current internal evaluation snapshot. In parallel, we are formalizing a more rigorous benchmark package that we expect to open source, including methodology, scoring rubrics, and reproducible task definitions.
A fair benchmark already assumes both systems can execute queries and access the same domain context. The harder question is whether the system can tell the agent its SQL is wrong even when the database runs it, or that a narrative is unsupported even when the numbers look superficially plausible.
System components
What sits inside the harness
The improvement does not come from one trick. It comes from stacking mechanisms that apply correction pressure before a wrong answer hardens into a final deliverable.
Contextually aware SQL compiler
We spent substantial research effort building a SQL compilation layer that understands warehouse structure and business definitions. It flags misaligned queries in the same spirit that a code compiler flags undefined variables or invalid types.
Semantic and policy graph
Metric definitions, ownership, exclusions, and approved logic are represented as operational context rather than as loose prose stuffed into a prompt. That turns semantic drift into something the agent can detect and correct.
Post-execution validation
After execution, the system evaluates outputs across multiple statistical and structural dimensions. It checks for impossible values, denominator drift, discontinuities, and other signatures that the query or the underlying data may be wrong.
Tracked and versioned calculations
Every calculation the agent makes is tracked, versioned, and available to rerun as the analysis develops. That gives the system a reproducibility loop and lets it catch its own inconsistencies before the answer is delivered.
We use the term SQL compiler intentionally. It is not a thin query transport. It is a context-aware compiler that can tell the agent when candidate SQL is out of alignment with metric definitions, segment semantics, or exclusion policy.
Post-execution validation matters just as much. Some queries are syntactically correct and semantically plausible, yet still wrong because they produce impossible transitions, unstable denominators, or implausible distributions.
Reproducibility is part of the correctness loop. Tracked and versioned calculations let the agent rerun parts of its own analysis and catch inconsistencies before delivery.
Comparison
System-level differences
Both systems can connect to a warehouse and generate SQL. The divergence appears when tasks require business-definition fidelity, repeated self-correction, and durable analytical state.
| Dimension | Naive baseline | Vuon |
|---|---|---|
| Information access | Schema snippets and metric definitions provided via context window, with direct SQL execution. | Same base access plus structured warehouse metadata, semantic graph, execution traces, and reusable artifacts. |
| Business definition handling | Definitions are mostly passive text in prompts, which means the model can ignore, reinterpret, or partially apply them. | Definitions are represented in a structured layer that can reject or redirect candidate work when logic departs from policy. |
| SQL error correction | Correction starts after the database or a person complains. Many semantic errors survive execution because the SQL is syntactically valid. | Correction starts before and after execution. The compiler and validation passes both contribute signals that the agent can act on. |
| Intermediate state management | Intermediate calculations often live only in the model conversation, so the chain of reasoning is fragile and hard to replay. | Intermediate calculations are explicit, versioned artifacts that can be reused, rerun, and audited. |
| Failure mode | Plausible but wrong analysis that appears complete because the narrative is coherent. | Visible correction loops that raise uncertainty, request another pass, or stop before a false answer is presented as final. |
This is why we do not expect the baseline to simply "catch up" by waiting for a better model. The model can improve while the system still lacks the domain-specific repair loops required for trustworthy analytics.
What this implies for evaluators
If two systems share the same context and frontier model class, a persistent gap of this size points to infrastructure, not prompt wording. The relevant procurement question is whether the system can enforce definitions, challenge its own outputs, and replay its analytical path under scrutiny.
In recurring enterprise tasks with verifiable answers, Vuon reaches the correct answer when the knowledge graph is populated and the harness can exercise correction loops. The baseline usually does not, and we have not seen model upgrades alone close that reliability gap.
Our benchmark release is intended to make that question testable, with task definitions and scoring that reward correctness and reproducibility rather than rhetorical confidence.