If you handle chemicals compliance, the hard part is that nothing stands still. Substances move onto restriction and candidate lists, hazard classifications get revised, the global move on PFAS keeps shifting, and the international conventions add entries at every meeting. Ask an AI whether a substance is restricted under REACH, what its current GHS classification is, or whether a persistent pollutant has been listed, and the answer comes back confident and often a revision or two out of date, sometimes citing a rule that does not exist.
The models reason about chemicals rules perfectly well. What fails them is reach: a general model cannot open the current restriction list or the latest convention text, and has no way to know which revision is in force. Give it that text, and it stops guessing.
That text is what Obsidian supplies, deep on the global chemicals regimes. We put the models through hundreds of complex chemicals tasks across REACH, CLP, the UN GHS and the Stockholm, Basel, Rotterdam and Minamata Conventions, each handled alone and connected to Obsidian.
AI is inaccurate on chemicals regulation
Alone, the models averaged 53 out of 100. Connect them to Obsidian and the average climbs to 95. The best pairing, gpt-5.4-mini, reached 95.8. The models did not change between those two numbers. Only the data in front of them did.
Chemicals work punishes stale knowledge harder than almost any field: a restriction status, a hazard classification or a listing that changed last quarter, and an answer built on last year's revision is simply wrong. That is where the data layer earns its place. gemini-3.1-flash-lite, at $0.175 per million tokens, climbs from 56 to 95 once connected, into the band of models many times its price. A light-tier model connected to Obsidian beat a frontier model answering alone in 16 of 16 head-to-head pairings on the chemicals set.
AI cannot point you to the official chemicals source
For a product-stewardship or regulatory-affairs team the citation is the deliverable. Connected to Obsidian, an answer arrives with the regime, the current restriction or listing, the revision in force and a direct link to the official text. Alone, you get a plausible reference to verify yourself, on questions where the status and the revision are the entire answer, and where a wrong call can hold up a product.
An answer with the tier-0 source attached is one you can forward to an auditor without re-checking it. That is the difference between a draft a model imagined and an obligation you can act on.
AI hallucinates
We broke every answer into its individual factual claims and checked each against the official source. The gap between the two grounded-claim numbers above is, for a substance restriction, a classification or a listing, the difference between an answer you can act on and one you re-check line by line. The ungrounded remainder is added context, not invented references.
The full data, for the purists
Every model, both conditions. "Alone" is the model with no data layer; "with Obsidian" is the same model connected. Accuracy is a 0 to 100 score from a blind judge against human-verified ground truth. "Grounded claims" is the share of the answer's atomic factual claims that trace back to the official source, alone versus with Obsidian.
| # | Model | Provider | Tier | Acc. alone | Acc. + Obsidian | Lift | Cites source | Status correct | Grounded claims (alone → +Obs) | Latency | Speed | Price /1M | Cost / question |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | gpt-5.4-mini | OpenAI | mid | 63.0 | 95.8 | +32.8 | 96% | 100% | 35% → 97% | 1.14s | 83 tok/s | $0.7 | $0.000486 |
| 2 | grok-4.3 | xAI | mid | 53.1 | 95.8 | +42.7 | 97% | 100% | 29% → 94% | 3.22s | 120 tok/s | $1.562 | $0.002179 |
| 3 | gpt-5.4-nano | OpenAI | light | 38.3 | 95.5 | +57.2 | 95% | 100% | 22% → 96% | 1.21s | 84 tok/s | $0.463 | $0.000302 |
| 4 | opus-4.8 | Anthropic | advanced | 58.4 | 95.5 | +37.1 | 97% | 100% | 20% → 85% | 3.7s | 71 tok/s | $10.0 | $0.013676 |
| 5 | gemini-3.1-flash-lite | light | 56.3 | 95.4 | +39.1 | 93% | 100% | 22% → 98% | 0.74s | 118 tok/s | $0.175 | $0.000125 | |
| 6 | gpt-5.5 | OpenAI | advanced | 40.7 | 95.4 | +54.7 | 95% | 100% | 46% → 96% | 4.76s | 33 tok/s | $11.25 | $0.009351 |
| 7 | grok-4.20-reasoning | xAI | advanced | 56.4 | 95.0 | +38.6 | 96% | 100% | 24% → 92% | 2.49s | 225 tok/s | $6.0 | $0.012179 |
| 8 | sonnet-4.6 | Anthropic | mid | 59.1 | 95.0 | +35.9 | 96% | 100% | 21% → 83% | 6.2s | 50 tok/s | $6.0 | $0.007406 |
| 9 | grok-3-mini | xAI | light | 49.7 | 94.8 | +45.1 | 95% | 98% | 32% → 91% | 3.19s | 118 tok/s | $0.35 | $0.000479 |
| 10 | gemini-3.5-flash | mid | 60.0 | 94.4 | +34.4 | 99% | 100% | 22% → 94% | 2.9s | 178 tok/s | $3.375 | $0.006279 | |
| 11 | gemini-3.1-pro | advanced | 61.1 | 93.9 | +32.8 | 95% | 100% | 23% → 96% | 5.79s | 111 tok/s | $6.0 | $0.013897 | |
| 12 | haiku-4.5 | Anthropic | light | 41.5 | 93.6 | +52.1 | 95% | 100% | 18% → 88% | 1.97s | 87 tok/s | $2.0 | $0.001841 |
On a domain that punishes stale knowledge, the connected accuracy and the grounded-claim jump are the tests that matter, and the data layer clears both.
How we measured it
- The full model set from Anthropic, OpenAI, Google and xAI.
- Hundreds of complex chemicals tasks across REACH, CLP, the UN GHS, the Stockholm, Basel, Rotterdam and Minamata Conventions and the Global Framework on Chemicals, each tied to its official source and current revision.
- Two conditions: the model alone, and connected to Obsidian.
- A blind judge scores each answer; grounded claims come from a separate per-claim check against the official source.
Put the official chemicals source behind every answer
Connect Obsidian to the AI you already use and every REACH, GHS or convention answer comes back with the regime, the current restriction and the revision in force. Free tier, two-minute setup.
Explore the Obsidian data layerWhat this means
For chemicals and advanced-materials teams tracking restrictions, listings and classifications across jurisdictions, the assistant you already use, given verified data, answers with the official source attached, so a regulatory specialist can act on it rather than re-checking it. The background is here too: tier-0 regulatory data and agentic regulatory intelligence. The full cross-industry results are in the regulatory AI benchmark. To test it on your own questions, connect the Obsidian regulatory data layer.