Research

Our research. Your agent's education.

Three research-backed metrics measuring the failure modes that matter most: hallucination, score inflation, and sycophancy. Validated across 8 frontier models, 360 evaluations. Submitted to a top-tier ML conference (under review).

Test your agent →

The Inverted Bloom

For AI, knowing is harder than thinking.

Traditional education follows Bloom's taxonomy: first remember, then understand, then create. AI inverts this completely. Your agent can write excellent code (Create, L6) but might not know who your top client is (Remember, L1).

The gap on pure knowledge tasks:

d = 2.36

One of the largest effect sizes measured in AI evaluation.

This is why your agent sounds smart but gets your business wrong.

Peer-reviewed metrics

Six metrics. One trust function.

Trust = F(Alignment, Reliability)

Three alignment metrics + three reliability metrics. Both inputs measured. Both inputs required.

ALIGNMENT — does the agent want the right things?

KRG

Knowledge-Reasoning Gradient

How big is the gap between your agent's ability to think and to know your business?

+85%

Grounded agents score 85% higher on novel domain tasks.

d = 0.04 → 2.36 · ρ = 0.80 · 8 models · p < 10⁻¹⁰

JIS

Honest Scoring

When your agent evaluates something, how much does it inflate the score?

1.45–2.64×

Vanilla agents inflate scores. DeepSeek rated a 31% workshop as 83%.

After training: JIS = 1.0

DEP

Divergence under External Pressure

When you're wrong, does your agent tell you — or agree to keep the peace?

74% → 26%

of responses flip under pressure. Reduced with training.

p < 0.0001

RELIABILITY — does the agent consistently act on what it wants?

AAL

Agent Assurance Level

Aviation DO-178C–equivalent assurance class.

AAL-C+

Today. Target: AAL-B by Q4 2026.

AAL-A: no catastrophic misalignment tolerated; AAL-E: minor risk allowed

FMC

Failure-Mode Coverage

Share of the 12-layer transition-graph failure modes covered by named guardrails.

82%

Today. Target: 95% on AAL-B scope.

Named guardrails or sentries

IST

Instinct Stability

Behavioral drift across the Markov instinct lifecycle. Measured as week-over-week Δ in axis scores.

<5%

drift/week. Target: 0% regressions on shipped guardrails.

Probationary → confirmed → habitual

Methodology you can reproduce.

•360 evaluations across 8 frontier models
•45 sub-items across 10 Bloom-level tasks
•Spearman ρ = 0.80, p < 0.001
•94.9% inter-judge agreement (dual-judge protocol)
•Domain transfer validated (Singapore Companies Act, WHO Hygiene Protocol)
•Submitted to a top-tier ML conference (under review)

All data, scripts, and scoring rubrics are available for reproduction.

View methodology →

See these metrics applied to your agent.

The Alignment Test uses KRG, JIS, and DEP to score your agent across 4 axes. 2 minutes. No installation.

Test Your Agent →