Benchmark + Monitoring (SPEC-07)

Two complementary deliverables:

Benchmark harness — offline / pre-deploy / regression-gate testing of retrieval quality
Azure Monitor — online / production / real-time observability

Benchmark harness

The 5 strategies

Strategy	Purpose
`bm25_only`	Pre-SPEC-05 baseline path
`graph_only`	Pure CKG — tests "graph can't answer T1" claim
`hybrid`	SPEC-05 classifier → best route (production path)
`hybrid_forced_graph`	Bypass classifier → GRAPH (exposes classifier under-routing to graph)
`hybrid_forced_text`	Bypass classifier → TEXT (exposes classifier over-routing to text)

Metrics per query-type × strategy

F1 — token-level precision + recall vs gold answer
EM — exact match
HR — hit rate (gold entity id in top-K)
RDS — retrieval decision score (composite quality)
Latency — p50, p95, p99
Tokens — serializer output size

Each with 95% CI via bootstrap (seed=42, 1000 resamples).

2 regression gates

Gate-1 — hybrid F1 ≥ max(other strategies) per query type (ensures hybrid dominates)
Gate-2 — HR invariants (no strategy has HR=0 on a type where others do)

Gate failure → non-zero exit code → CI fails → compliance-checklist.yml fatal probe.

CLI

cd apps/research/benchmark

# Smoke test (tiny dataset)
python3 -m benchmark.cli --smoke

# Full run
python3 -m benchmark.cli --out /tmp/benchmark-results.md

# Single strategy
python3 -m benchmark.cli --strategy hybrid --out /tmp/hybrid-only.md

# Exit codes: 0 = gates pass · 1 = gate fail · 2 = run error

Rate-limit + runtime

250 queries × 5 strategies = 1250 retrievals. At the default 100/min per IP limit, serialized run = ~13 min. Use a service-account admin token (bypasses per-user limit) for faster runs.

Azure Monitor

Files (`infrastructure/azure-monitor/`)

infrastructure/azure-monitor/
├── queries/           # 12 KQL files
│   ├── retrieval-latency-per-route.kql
│   ├── fallback-rate-rolling.kql
│   ├── classifier-confusion-matrix.kql
│   └── ...
├── dashboards/        # 3 workbook ARM templates
│   ├── quality.json
│   ├── speed.json
│   └── cost.json
├── alerts/            # 6 scheduled-query-rule ARMs
└── deploy.sh          # idempotent --dry-run | --what-if | --env

Count-only invariant

Every KQL query aggregates — never exports raw entities_referenced lists. The count-only invariant (FR-SPEC-05-015) holds end-to-end: audit-chain storage → KQL aggregation → dashboard rendering.

Compliance probes

SPEC-07 adds 4 benchmark_regression fatal probes to compliance-checklist.yml:

Probe	What it checks
`b1-benchmark-gate-regression`	Runs harness against current code; FAIL if gate-1 or gate-2 fails
`b2-fixture-phi-free`	Runtime grep probe on 250-query fixture (synthetic tokens only)
`b3-no-phi-in-benchmark-reports`	Report renderer does not write raw query_text field
`b4-kql-count-only-invariant`	KQL queries avoid `make_list` on entities_referenced

All 4 fail-closed — compliance-checklist refuses verdict=pass if any fails.

Deferred follow-ups

Scoped out of PR #186:

Atlas Vector Search prep — migrate BM25 + vector embeddings
Clinical CKG pipelines — UMLS / SNOMED / ICD mappings
MemoryHttpClient.search() + ?legacy=true removal — 30-day soak cutover

Why a separate harness?

Pre-deploy confidence — gates block a regression from reaching prod
Evaluation consistency — same 250 queries, same gold answers, same scorer across releases
Strategy tuning — hybrid_forced_graph + hybrid_forced_text expose classifier errors without requiring a new dataset
Audit trail — benchmark run → markdown report → PR comment on release PRs

Benchmark harness​

The 5 strategies​

Metrics per query-type × strategy​

2 regression gates​

CLI​

Rate-limit + runtime​

Azure Monitor​

Files (infrastructure/azure-monitor/)​

Count-only invariant​

Compliance probes​

Deferred follow-ups​

Why a separate harness?​