Skip to main content

Benchmark + Monitoring (SPEC-07)

Two complementary deliverables:

  • Benchmark harness — offline / pre-deploy / regression-gate testing of retrieval quality
  • Azure Monitor — online / production / real-time observability

Benchmark harness

The 5 strategies

StrategyPurpose
bm25_onlyPre-SPEC-05 baseline path
graph_onlyPure CKG — tests "graph can't answer T1" claim
hybridSPEC-05 classifier → best route (production path)
hybrid_forced_graphBypass classifier → GRAPH (exposes classifier under-routing to graph)
hybrid_forced_textBypass classifier → TEXT (exposes classifier over-routing to text)

Metrics per query-type × strategy

  • F1 — token-level precision + recall vs gold answer
  • EM — exact match
  • HR — hit rate (gold entity id in top-K)
  • RDS — retrieval decision score (composite quality)
  • Latency — p50, p95, p99
  • Tokens — serializer output size

Each with 95% CI via bootstrap (seed=42, 1000 resamples).

2 regression gates

  • Gate-1 — hybrid F1 ≥ max(other strategies) per query type (ensures hybrid dominates)
  • Gate-2 — HR invariants (no strategy has HR=0 on a type where others do)

Gate failure → non-zero exit code → CI fails → compliance-checklist.yml fatal probe.

CLI

cd apps/research/benchmark

# Smoke test (tiny dataset)
python3 -m benchmark.cli --smoke

# Full run
python3 -m benchmark.cli --out /tmp/benchmark-results.md

# Single strategy
python3 -m benchmark.cli --strategy hybrid --out /tmp/hybrid-only.md

# Exit codes: 0 = gates pass · 1 = gate fail · 2 = run error

Rate-limit + runtime

250 queries × 5 strategies = 1250 retrievals. At the default 100/min per IP limit, serialized run = ~13 min. Use a service-account admin token (bypasses per-user limit) for faster runs.

Azure Monitor

Files (infrastructure/azure-monitor/)

infrastructure/azure-monitor/
├── queries/ # 12 KQL files
│ ├── retrieval-latency-per-route.kql
│ ├── fallback-rate-rolling.kql
│ ├── classifier-confusion-matrix.kql
│ └── ...
├── dashboards/ # 3 workbook ARM templates
│ ├── quality.json
│ ├── speed.json
│ └── cost.json
├── alerts/ # 6 scheduled-query-rule ARMs
└── deploy.sh # idempotent --dry-run | --what-if | --env

Count-only invariant

Every KQL query aggregates — never exports raw entities_referenced lists. The count-only invariant (FR-SPEC-05-015) holds end-to-end: audit-chain storage → KQL aggregation → dashboard rendering.

Compliance probes

SPEC-07 adds 4 benchmark_regression fatal probes to compliance-checklist.yml:

ProbeWhat it checks
b1-benchmark-gate-regressionRuns harness against current code; FAIL if gate-1 or gate-2 fails
b2-fixture-phi-freeRuntime grep probe on 250-query fixture (synthetic tokens only)
b3-no-phi-in-benchmark-reportsReport renderer does not write raw query_text field
b4-kql-count-only-invariantKQL queries avoid make_list on entities_referenced

All 4 fail-closed — compliance-checklist refuses verdict=pass if any fails.

Deferred follow-ups

Scoped out of PR #186:

  • Atlas Vector Search prep — migrate BM25 + vector embeddings
  • Clinical CKG pipelines — UMLS / SNOMED / ICD mappings
  • MemoryHttpClient.search() + ?legacy=true removal — 30-day soak cutover

Why a separate harness?

  • Pre-deploy confidence — gates block a regression from reaching prod
  • Evaluation consistency — same 250 queries, same gold answers, same scorer across releases
  • Strategy tuninghybrid_forced_graph + hybrid_forced_text expose classifier errors without requiring a new dataset
  • Audit trail — benchmark run → markdown report → PR comment on release PRs