NIST AI RMFEU AI Act Art 15OWASP LLM Top-10Model EvalDeep Prototype

ModelEvalPanel — LLM Eval Scorecard for Prompt Regressions

36 eval cases × 4 models × 6 risk dimensions: factuality, refusal correctness, jailbreak resistance, PII leakage, instruction following, bias. Cross-model pass-rate matrix with current-vs-baseline regressions (color-coded deltas). Per-case prompt + expected behavior + per-model outcome.

ModelEvalPanel — LLM Eval Scorecard for Prompt Regressions preview
Open live →

What it is

The shape behind every LLM eval harness (Promptfoo, LangSmith evaluations, OpenAI Evals, Weights & Biases). Cross-model pass-rate matrix you read in 5 seconds, plus per-case drilldown.

What’s in it

  • 36 eval cases across 6 risk dimensions, 6 cases each:
    • Factuality — capital cities, treaty dates, half-lives, FDA dosing, GDPR articles, Booker Prize winners
    • Refusal correctness — must refuse harmful (synth meth); must NOT over-refuse benign (CPR steps, suicide-crisis support, first aid)
    • Jailbreak resistance — DAN, “creative writing” framing, translation-vector, timeshift, the grandma exploit, base64-encoded
    • PII leakage — fake-profile generation, transcript handling, JFK SSN, address-to-ZIP, API-key leak in output, phishing-template refusal
    • Instruction following — JSON-only, exact word counts, alphabetical lists, exact-phrase refusal, “respond with second word only”, ping/pong protocol
    • Bias / fairness — gender-default nurse/doctor, 5-scientist representational gap, CEO stereotype, pain-bias, baby-name defaults, loan rating on name alone
  • 4 models scored — gpt-4o, claude-3.5-sonnet, llama-3.1-70b, gemini-1.5-pro.
  • Cross-model matrix — pass-rate by dimension × model, plus current-vs-baseline pp delta (color-coded).
  • Regressions — every case flags if it regressed from the prior baseline (the metric the team treats as more important than the absolute number).

Why this shape

NIST AI RMF 1.0 (MEASURE.2.6 — “ongoing testing of system performance”) + EU AI Act Art 15 (accuracy, robustness, cybersecurity throughout the lifecycle) + ISO/IEC 42001 §8.4 require this exact shape: continuous eval with baselines, regressions, and dimensional coverage. OWASP LLM Top-10 (LLM01 prompt injection, LLM06 sensitive information disclosure) lives on the same matrix.

How it ships

Single HTML file, ~21KB. Zero dependencies. 36 cases × 6 dimensions × 4 models + cross-model matrix renderer in 200 lines of vanilla JavaScript.

Open the tool →