If you work in medical-device or pharma regulatory affairs and quality, one detail decides whether an answer is usable: the edition. ISO 13485, ISO 14971, ISO 14155, IEC 62304 and the ICH guidelines all get revised, and citing a superseded edition in a submission or an audit is not a small slip, it is a finding. Ask an AI which edition is current, what the latest good-clinical-practice revision changed, or which guidance covers AI and machine-learning devices, and it answers fluently with the edition it was trained on, which may have been withdrawn.

The models reason about the standards perfectly well. What fails them is reach: a general model cannot know which edition is in force today. Give it the current text, and it stops guessing.

That text is what Obsidian supplies, deep on the global life-sciences standards. We put the models through hundreds of complex tasks across the ISO and IEC medtech standards, the ICH guidelines and the IMDRF guidance, each handled alone and connected to Obsidian.

52 → 96
Average regulatory accuracy, the same models alone vs. connected (out of 100)
31% → 94%
Share of an answer's factual claims grounded in the official source
97%
Connected answers that cited the correct official source

AI is inaccurate for life-sciences regulatory work

Alone, the models averaged 52 out of 100. Connect them to Obsidian and the average climbs to 96. The best pairing, gpt-5.5, reached 97.5. The models did not change between those two numbers. Only the data in front of them did.

Regulatory accuracy versus price per 1M tokens
Regulatory accuracy against price. Connected to Obsidian (the wider coins), every model converges near the top.
Regulatory accuracy versus average response time in seconds
The same against response time.

Life sciences is the sharpest case for a data layer anywhere in this benchmark: editions move constantly, and a model working from memory cites the one it learned rather than the one in force. That single gap is where a submission gets held up. gemini-3.1-flash-lite, at $0.175 per million tokens, climbs from 56 to 97 once connected, into the band of models many times its price. A light-tier model connected to Obsidian beat a frontier model answering alone in 16 of 16 head-to-head pairings on the life-sciences set.

AI cannot point you to the official standard

Here the edition is the answer. Connected to Obsidian, a response arrives with the standard, its current edition, the issuing body and a direct link. Alone, you get a plausible citation, often the wrong edition, to verify yourself. For a notified-body review or a regulatory submission, the difference between the current edition and a withdrawn one is the difference between a defensible file and a finding.

An answer with the tier-0 source attached is one you can forward to an auditor without re-checking it. That is the difference between a draft a model imagined and an obligation you can act on.

AI hallucinates

We broke every answer into its individual factual claims and checked each against the official source. The gap between the two grounded-claim numbers above is the dangerous kind of error removed on a field where a single wrong edition undermines a submission. The ungrounded remainder is added context, not invented references.

The full data, for the purists

Every model, both conditions. "Alone" is the model with no data layer; "with Obsidian" is the same model connected. Accuracy is a 0 to 100 score from a blind judge against human-verified ground truth. "Grounded claims" is the share of the answer's atomic factual claims that trace back to the official source, alone versus with Obsidian.

#ModelProviderTierAcc. aloneAcc. + ObsidianLiftCites sourceStatus correctGrounded claims (alone → +Obs)LatencySpeedPrice /1MCost / question
1gpt-5.5OpenAIadvanced38.197.5+59.496%100%42% → 98%4.73s49 tok/s$11.25$0.026259
2grok-3-minixAIlight46.597.2+50.799%100%34% → 94%3.21s136 tok/s$0.35$0.001342
3gpt-5.4-nanoOpenAIlight34.597.0+62.597%100%24% → 98%1.49s88 tok/s$0.463$0.000922
4gemini-3.1-flash-liteGooglelight56.296.8+40.698%100%30% → 98%0.86s139 tok/s$0.175$0.000469
5gpt-5.4-miniOpenAImid63.196.7+33.697%100%38% → 95%1.28s87 tok/s$0.7$0.001685
6grok-4.20-reasoningxAIadvanced62.696.6+34.096%100%30% → 96%2.86s226 tok/s$6.0$0.021106
7opus-4.8Anthropicadvanced65.396.6+31.396%100%28% → 93%5.86s69 tok/s$10.0$0.039476
8gemini-3.5-flashGooglemid57.296.3+39.199%100%34% → 98%3.62s183 tok/s$3.375$0.012549
9grok-4.3xAImid44.596.3+51.898%100%32% → 97%3.21s132 tok/s$1.562$0.005775
10haiku-4.5Anthropiclight32.796.3+63.698%100%22% → 90%3.98s64 tok/s$2.0$0.005482
11sonnet-4.6Anthropicmid63.395.0+31.796%100%26% → 85%9.57s42 tok/s$6.0$0.019201
12gemini-3.1-proGoogleadvanced62.894.5+31.792%100%42% → 98%6.25s107 tok/s$6.0$0.020789

Life sciences shows the largest gap between a model working from memory and one reading the current standard, which is exactly where a maintained data layer earns its place.

How we measured it

  • The full model set from Anthropic, OpenAI, Google and xAI.
  • Hundreds of complex life-sciences tasks across the ISO and IEC medtech standards (quality, risk, clinical, software, biocompatibility), the ICH guidelines and the IMDRF guidance, each tied to its official source and current edition.
  • Two conditions: the model alone, and connected to Obsidian.
  • A blind judge scores each answer; grounded claims come from a separate per-claim check against the official source.

Put the right edition of every standard behind your AI

Connect Obsidian to the AI you already use and every standards answer comes back with the current edition and issuing body. Free tier, two-minute setup.

Explore the Obsidian data layer

What this means

For medical-device and pharma regulatory and quality teams, the assistant you already use, given verified data, stops citing withdrawn editions and starts answering with the standard and edition that are actually in force, so a reviewer can rely on it in a submission. The background is here too: tier-0 regulatory data and agentic regulatory intelligence. The full cross-industry results are in the regulatory AI benchmark. To test it on your own questions, connect the Obsidian regulatory data layer.