If you handle chemicals compliance, the hard part is that nothing stands still. Substances move onto restriction and candidate lists, hazard classifications get revised, the global move on PFAS keeps shifting, and the international conventions add entries at every meeting. Ask an AI whether a substance is restricted under REACH, what its current GHS classification is, or whether a persistent pollutant has been listed, and the answer comes back confident and often a revision or two out of date, sometimes citing a rule that does not exist.

The models reason about chemicals rules perfectly well. What fails them is reach: a general model cannot open the current restriction list or the latest convention text, and has no way to know which revision is in force. Give it that text, and it stops guessing.

That text is what Obsidian supplies, deep on the global chemicals regimes. We put the models through hundreds of complex chemicals tasks across REACH, CLP, the UN GHS and the Stockholm, Basel, Rotterdam and Minamata Conventions, each handled alone and connected to Obsidian.

53 → 95
Average regulatory accuracy, the same models alone vs. connected (out of 100)
24% → 91%
Share of an answer's factual claims grounded in the official source
96%
Connected answers that cited the correct official source

AI is inaccurate on chemicals regulation

Alone, the models averaged 53 out of 100. Connect them to Obsidian and the average climbs to 95. The best pairing, gpt-5.4-mini, reached 95.8. The models did not change between those two numbers. Only the data in front of them did.

Regulatory accuracy versus price per 1M tokens
Regulatory accuracy against price. Connected to Obsidian (the wider coins), every model converges near the top.
Regulatory accuracy versus average response time in seconds
The same against response time.

Chemicals work punishes stale knowledge harder than almost any field: a restriction status, a hazard classification or a listing that changed last quarter, and an answer built on last year's revision is simply wrong. That is where the data layer earns its place. gemini-3.1-flash-lite, at $0.175 per million tokens, climbs from 56 to 95 once connected, into the band of models many times its price. A light-tier model connected to Obsidian beat a frontier model answering alone in 16 of 16 head-to-head pairings on the chemicals set.

AI cannot point you to the official chemicals source

For a product-stewardship or regulatory-affairs team the citation is the deliverable. Connected to Obsidian, an answer arrives with the regime, the current restriction or listing, the revision in force and a direct link to the official text. Alone, you get a plausible reference to verify yourself, on questions where the status and the revision are the entire answer, and where a wrong call can hold up a product.

An answer with the tier-0 source attached is one you can forward to an auditor without re-checking it. That is the difference between a draft a model imagined and an obligation you can act on.

AI hallucinates

We broke every answer into its individual factual claims and checked each against the official source. The gap between the two grounded-claim numbers above is, for a substance restriction, a classification or a listing, the difference between an answer you can act on and one you re-check line by line. The ungrounded remainder is added context, not invented references.

The full data, for the purists

Every model, both conditions. "Alone" is the model with no data layer; "with Obsidian" is the same model connected. Accuracy is a 0 to 100 score from a blind judge against human-verified ground truth. "Grounded claims" is the share of the answer's atomic factual claims that trace back to the official source, alone versus with Obsidian.

#ModelProviderTierAcc. aloneAcc. + ObsidianLiftCites sourceStatus correctGrounded claims (alone → +Obs)LatencySpeedPrice /1MCost / question
1gpt-5.4-miniOpenAImid63.095.8+32.896%100%35% → 97%1.14s83 tok/s$0.7$0.000486
2grok-4.3xAImid53.195.8+42.797%100%29% → 94%3.22s120 tok/s$1.562$0.002179
3gpt-5.4-nanoOpenAIlight38.395.5+57.295%100%22% → 96%1.21s84 tok/s$0.463$0.000302
4opus-4.8Anthropicadvanced58.495.5+37.197%100%20% → 85%3.7s71 tok/s$10.0$0.013676
5gemini-3.1-flash-liteGooglelight56.395.4+39.193%100%22% → 98%0.74s118 tok/s$0.175$0.000125
6gpt-5.5OpenAIadvanced40.795.4+54.795%100%46% → 96%4.76s33 tok/s$11.25$0.009351
7grok-4.20-reasoningxAIadvanced56.495.0+38.696%100%24% → 92%2.49s225 tok/s$6.0$0.012179
8sonnet-4.6Anthropicmid59.195.0+35.996%100%21% → 83%6.2s50 tok/s$6.0$0.007406
9grok-3-minixAIlight49.794.8+45.195%98%32% → 91%3.19s118 tok/s$0.35$0.000479
10gemini-3.5-flashGooglemid60.094.4+34.499%100%22% → 94%2.9s178 tok/s$3.375$0.006279
11gemini-3.1-proGoogleadvanced61.193.9+32.895%100%23% → 96%5.79s111 tok/s$6.0$0.013897
12haiku-4.5Anthropiclight41.593.6+52.195%100%18% → 88%1.97s87 tok/s$2.0$0.001841

On a domain that punishes stale knowledge, the connected accuracy and the grounded-claim jump are the tests that matter, and the data layer clears both.

How we measured it

  • The full model set from Anthropic, OpenAI, Google and xAI.
  • Hundreds of complex chemicals tasks across REACH, CLP, the UN GHS, the Stockholm, Basel, Rotterdam and Minamata Conventions and the Global Framework on Chemicals, each tied to its official source and current revision.
  • Two conditions: the model alone, and connected to Obsidian.
  • A blind judge scores each answer; grounded claims come from a separate per-claim check against the official source.

Put the official chemicals source behind every answer

Connect Obsidian to the AI you already use and every REACH, GHS or convention answer comes back with the regime, the current restriction and the revision in force. Free tier, two-minute setup.

Explore the Obsidian data layer

What this means

For chemicals and advanced-materials teams tracking restrictions, listings and classifications across jurisdictions, the assistant you already use, given verified data, answers with the official source attached, so a regulatory specialist can act on it rather than re-checking it. The background is here too: tier-0 regulatory data and agentic regulatory intelligence. The full cross-industry results are in the regulatory AI benchmark. To test it on your own questions, connect the Obsidian regulatory data layer.