Benchmark + Monitoring (SPEC-07)
Two complementary deliverables:
- Benchmark harness — offline / pre-deploy / regression-gate testing of retrieval quality
- Azure Monitor — online / production / real-time observability
Benchmark harness
The 5 strategies
| Strategy | Purpose |
|---|---|
bm25_only | Pre-SPEC-05 baseline path |
graph_only | Pure CKG — tests "graph can't answer T1" claim |
hybrid | SPEC-05 classifier → best route (production path) |
hybrid_forced_graph | Bypass classifier → GRAPH (exposes classifier under-routing to graph) |
hybrid_forced_text | Bypass classifier → TEXT (exposes classifier over-routing to text) |
Metrics per query-type × strategy
- F1 — token-level precision + recall vs gold answer
- EM — exact match
- HR — hit rate (gold entity id in top-K)
- RDS — retrieval decision score (composite quality)
- Latency — p50, p95, p99
- Tokens — serializer output size
Each with 95% CI via bootstrap (seed=42, 1000 resamples).
2 regression gates
- Gate-1 — hybrid F1 ≥ max(other strategies) per query type (ensures hybrid dominates)
- Gate-2 — HR invariants (no strategy has HR=0 on a type where others do)
Gate failure → non-zero exit code → CI fails → compliance-checklist.yml fatal probe.
CLI
cd apps/research/benchmark
# Smoke test (tiny dataset)
python3 -m benchmark.cli --smoke
# Full run
python3 -m benchmark.cli --out /tmp/benchmark-results.md
# Single strategy
python3 -m benchmark.cli --strategy hybrid --out /tmp/hybrid-only.md
# Exit codes: 0 = gates pass · 1 = gate fail · 2 = run error
Rate-limit + runtime
250 queries × 5 strategies = 1250 retrievals. At the default 100/min per IP limit, serialized run = ~13 min. Use a service-account admin token (bypasses per-user limit) for faster runs.
Azure Monitor
Files (infrastructure/azure-monitor/)
infrastructure/azure-monitor/
├── queries/ # 12 KQL files
│ ├── retrieval-latency-per-route.kql
│ ├── fallback-rate-rolling.kql
│ ├── classifier-confusion-matrix.kql
│ └── ...
├── dashboards/ # 3 workbook ARM templates
│ ├── quality.json
│ ├── speed.json
│ └── cost.json
├── alerts/ # 6 scheduled-query-rule ARMs
└── deploy.sh # idempotent --dry-run | --what-if | --env
Count-only invariant
Every KQL query aggregates — never exports raw entities_referenced lists. The count-only invariant (FR-SPEC-05-015) holds end-to-end: audit-chain storage → KQL aggregation → dashboard rendering.
Compliance probes
SPEC-07 adds 4 benchmark_regression fatal probes to compliance-checklist.yml:
| Probe | What it checks |
|---|---|
b1-benchmark-gate-regression | Runs harness against current code; FAIL if gate-1 or gate-2 fails |
b2-fixture-phi-free | Runtime grep probe on 250-query fixture (synthetic tokens only) |
b3-no-phi-in-benchmark-reports | Report renderer does not write raw query_text field |
b4-kql-count-only-invariant | KQL queries avoid make_list on entities_referenced |
All 4 fail-closed — compliance-checklist refuses verdict=pass if any fails.
Deferred follow-ups
Scoped out of PR #186:
- Atlas Vector Search prep — migrate BM25 + vector embeddings
- Clinical CKG pipelines — UMLS / SNOMED / ICD mappings
MemoryHttpClient.search()+?legacy=trueremoval — 30-day soak cutover
Why a separate harness?
- Pre-deploy confidence — gates block a regression from reaching prod
- Evaluation consistency — same 250 queries, same gold answers, same scorer across releases
- Strategy tuning —
hybrid_forced_graph+hybrid_forced_textexpose classifier errors without requiring a new dataset - Audit trail — benchmark run → markdown report → PR comment on release PRs