AI is already in every compliance workflow. People ask Claude, ChatGPT, Gemini, and Grok regulatory questions every day, and they get fluent, confident answers back. The unspoken problem is that nobody knows whether to trust them. The citation might not exist. The text might be two years stale. The model cannot tell you whether a rule is a draft, in consultation, or binding today.

So we did the obvious thing: we measured it. We built a benchmark that asks leading AI models real, current regulatory questions, twice. Once on their own. Once connected to Obsidian, our verified, tier-0 regulatory data layer. This article is the first read-out, and the result is not subtle.

39 → 94
Average regulatory accuracy, models alone vs. with Obsidian (0 to 100)
32% → 93%
Share of an answer's factual claims grounded in the official source, alone vs. with Obsidian
~100%
Answers that cited the correct official source, with Obsidian

Average accuracy across twelve models jumped from 39.1 to 93.8 out of 100, a +54.7 point lift, just by giving the model the right data to read. The best pairing, Claude Sonnet 4.6 with Obsidian, scored 99/100. The cheapest model in the test, a nano-tier model that scores 12 on its own, reached 98.6 once connected. The takeaway is the thesis of this whole project: the bottleneck in regulatory AI was never the model's intelligence. It was the data it could reach.

Accuracy has almost nothing to do with price

The first chart plots regulatory accuracy against price. Each model appears twice: the plain provider logo is the model answering alone, and the wider provider-and-Obsidian coin is the same model connected to the data layer. A faint line links the two so you can see the lift.

Scatter chart of regulatory accuracy versus price per million tokens for twelve AI models, each shown alone and connected to the Obsidian data layer. Connected models cluster near the top regardless of price.
Regulatory accuracy vs. price. Alone, models scatter from 11 to 69. Connected to Obsidian, they all land between 87 and 99, independent of how much they cost.

Read it from the bottom up. On their own, the orange and dark logos are scattered low and wide: a frontier model alone might score 67, a cheap one 12. Connect them, and every single one is pulled up into a tight band between 87 and 99, with no relationship to price left. A $0.18-per-million-token model and a $11-per-million-token model end up within a few points of each other. You are no longer paying for accuracy. You are paying for a data layer, and it costs a fraction of the model.

Three myths this benchmark breaks

Everyone in regulatory affairs has heard the same three objections to using AI. Each one is testable, and each one falls.

Myth 1: "AI is too inaccurate for regulatory work."

True for a raw model. Not true with a data layer. The average alone-score of 39 is exactly the inaccuracy people complain about, and they are right to. But the same models, given verified data to read, average 94 and top out at 99. Accuracy was never an inherent ceiling of the model. It was a missing input.

Myth 2: "Good regulatory AI has to be slow."

This chart settles it. It plots accuracy against average response time in seconds.

Scatter chart of regulatory accuracy versus average response time in seconds for twelve AI models, alone and connected to Obsidian. The fastest models, answering in under two seconds, reach the top accuracy band once connected.
Regulatory accuracy vs. average response time. The sub-two-second models reach the same 90s accuracy band as the slow reasoning models, once connected.

The fastest models in the test answer in well under two seconds and, connected to Obsidian, sit in the same accuracy band as the slow, heavy reasoning models. You do not trade speed for trustworthiness. A nano model answering in 1.9 seconds reaches 98.6/100. Speed and accuracy stopped being a trade-off.

Myth 3: "AI just makes up regulation numbers."

This is the real fear, and it is the most important to address honestly. A raw model, asked for a specific regulation, article, or date, will often invent one that looks right and does not exist. In a brainstorm that is harmless. In a compliance memo, a fabricated citation is a liability.

The fix is not a model that magically never errs. It is an answer you can verify, and we measured exactly how verifiable. We broke every answer into its atomic factual claims and checked each one against the official source and the verified facts. Answering alone, only 32% of a model's claims were backed by a verifiable source. 68% were unsupported, and 8% flatly contradicted the official text. Connected to Obsidian, 93% of claims were grounded in the official source, and not a single one contradicted it. The dangerous failure, a confident statement you cannot check, drops to zero. The citation is no longer something the model dreamed up. It is a tier-0 link the model retrieved, and you can click it.

This is the thing no general-purpose AI does today: return the direct link to the official, tier-0 source, every time, with the legal status attached. That is what makes an answer defensible in front of an auditor.

The full data, for the purists

Here is every model, every metric, both conditions. "Alone" is the model with no data layer. "With Obsidian" is the same model connected. Accuracy is a 0 to 100 score from a blind judge against human-verified ground truth. "Cites source" is the share of answers that returned the correct official reference. "Status" is the share that got the legal status right. "Grounded claims" is the share of the answer's atomic factual claims that trace back to the official source or verified facts, alone versus with Obsidian.

#ModelProviderTier Acc. aloneAcc. + ObsidianLift Cites sourceStatus correct Grounded claims (alone → +Obs) LatencySpeedPrice /1MCost / question
1sonnet-4.6Anthropicmid63.499.0+35.6100%100%29% → 75%7.6s56 tok/s$6.00$0.020
2gpt-5.4-nanoOpenAIlight12.498.6+86.2100%100%30% → 100%2.0s107 tok/s$0.46$0.008
3gpt-5.4-miniOpenAImid68.898.0+29.2100%100%57% → 100%1.9s96 tok/s$0.70$0.008
4opus-4.8Anthropicadvanced67.498.0+30.6100%100%34% → 85%6.5s75 tok/s$10.00$0.033
5gpt-5.5OpenAIadvanced19.096.0+77.0100%100%50% → 100%5.1s58 tok/s$11.25$0.024
6grok-3-minixAIlight35.494.0+58.6100%100%39% → 100%3.3s165 tok/s$0.35$0.009
7gemini-3.1-proGoogleadvanced35.693.4+57.8100%100%26% → 98%8.5s119 tok/s$6.00$0.035
8grok-4.3xAImid37.293.4+56.2100%100%43% → 96%3.3s162 tok/s$1.56$0.016
9haiku-4.5Anthropiclight11.490.2+78.8100%80%26% → 90%4.3s88 tok/s$2.00$0.022
10grok-4.20-reasoningxAIadvanced43.889.6+45.890%100%26% → 97%3.0s262 tok/s$6.00$0.029
11gemini-3.1-flash-liteGooglelight38.488.4+50.0100%100%18% → 100%1.2s214 tok/s$0.18$0.013
12gemini-3.5-flashGooglemid36.887.0+50.2100%100%24% → 93%5.9s212 tok/s$3.38$0.031

A weak model connected to Obsidian beat a frontier model answering alone in 16 out of 16 head-to-head pairings. Across all twelve models, citation of the correct official source rose to roughly 100% with the data layer, and legal status was correct in 11 of 12 cases at 100%. Pooled across every answer, grounded claims rose from 32% alone to 93% with Obsidian, and zero claims contradicted the official source, versus 8% alone. The few ungrounded claims that remain are extra context a model added beyond the source, not invented citations.

How we measured it

Transparency is the point of a benchmark, so here is the method in full.

  • Twelve models from Anthropic, OpenAI, Google, and xAI, spanning light, mid, and advanced tiers.
  • Five real, current ESG regulatory questions, each in a different jurisdiction (Belgium, France, Italy, the Netherlands, and the United States) and with a different objective: find the official transposition instrument, state what a rule requires, give the current legal status, and identify changed deadlines.
  • Two conditions per question: the model alone, and the same model with the Obsidian data layer injected as context. Nothing else changes.
  • A blind LLM judge scores each answer against human-verified ground truth, every question tied to its official tier-0 source, grading factual correctness, source precision, completeness, citation validity, legal status, and hallucination.

This first run is a five-question pilot, run end to end for $3.01 with zero failed calls. It is deliberately small so we could tune the harness and the grading. A full-scale benchmark with many more questions across more domains is in progress, and we will publish it the same way: every number reproducible, every question tied to its official source. The grounded-claims figures come from a separate, stricter check: every answer is decomposed into its atomic factual claims and each one is verified against the official source and the ground truth, so the number reflects how much of an answer is actually backed by tier-0 data.

Turn your AI into the model in row one

Connect Obsidian to Claude, ChatGPT, Gemini, or Cursor and every regulatory answer comes back with its official source, date, and legal status. Free tier, two-minute setup.

Explore the Obsidian data layer

What this means for compliance and regulatory teams

If you already work through an AI assistant, this benchmark says something concrete: you do not need a more expensive model, and you do not need to accept guesses. The same assistant you use today, given verified regulatory data, answers with the accuracy of a specialist and the receipts of an auditor. The shift is from "the model probably knows" to "the model looked it up, and here is the official document."

That is the difference between an answer you have to re-check and an answer you can forward. For chemicals, ESG, and life sciences teams tracking change across jurisdictions, it is the difference between AI as a toy and AI as a tool you can stand behind.

If you want the background, we cover why AI hallucinates on regulatory questions, what tier-0 regulatory data actually means, and the broader idea of agentic regulatory intelligence. When you are ready to test it on your own questions, connect the Obsidian regulatory data layer and ask.