Research
Our research. Your agent's education.
Three research-backed metrics measuring the failure modes that matter most: hallucination, score inflation, and sycophancy. Validated across 8 frontier models, 360 evaluations. Submitted to a top-tier ML conference (under review).
Test your agent →The Inverted Bloom
For AI, knowing is harder than thinking.
Traditional education follows Bloom's taxonomy: first remember, then understand, then create. AI inverts this completely. Your agent can write excellent code (Create, L6) but might not know who your top client is (Remember, L1).
The gap on pure knowledge tasks:
d = 2.36
One of the largest effect sizes measured in AI evaluation.
This is why your agent sounds smart but gets your business wrong.
Peer-reviewed metrics
Six metrics. One trust function.
Trust = F(Alignment, Reliability)
Three alignment metrics + three reliability metrics. Both inputs measured. Both inputs required.
ALIGNMENT — does the agent want the right things?
Knowledge-Reasoning Gradient
How big is the gap between your agent's ability to think and to know your business?
+85%
Grounded agents score 85% higher on novel domain tasks.
d = 0.04 → 2.36 · ρ = 0.80 · 8 models · p < 10⁻¹⁰
Honest Scoring
When your agent evaluates something, how much does it inflate the score?
1.45–2.64×
Vanilla agents inflate scores. DeepSeek rated a 31% workshop as 83%.
After training: JIS = 1.0
Divergence under External Pressure
When you're wrong, does your agent tell you — or agree to keep the peace?
74% → 26%
of responses flip under pressure. Reduced with training.
p < 0.0001
RELIABILITY — does the agent consistently act on what it wants?
Agent Assurance Level
Aviation DO-178C–equivalent assurance class.
AAL-C+
Today. Target: AAL-B by Q4 2026.
AAL-A: no catastrophic misalignment tolerated; AAL-E: minor risk allowed
Failure-Mode Coverage
Share of the 12-layer transition-graph failure modes covered by named guardrails.
82%
Today. Target: 95% on AAL-B scope.
Named guardrails or sentries
Instinct Stability
Behavioral drift across the Markov instinct lifecycle. Measured as week-over-week Δ in axis scores.
<5%
drift/week. Target: 0% regressions on shipped guardrails.
Probationary → confirmed → habitual
Methodology you can reproduce.
- •360 evaluations across 8 frontier models
- •45 sub-items across 10 Bloom-level tasks
- •Spearman ρ = 0.80, p < 0.001
- •94.9% inter-judge agreement (dual-judge protocol)
- •Domain transfer validated (Singapore Companies Act, WHO Hygiene Protocol)
- •Submitted to a top-tier ML conference (under review)
All data, scripts, and scoring rubrics are available for reproduction.
View methodology →See these metrics applied to your agent.
The Alignment Test uses KRG, JIS, and DEP to score your agent across 4 axes. 2 minutes. No installation.
Test Your Agent →