Observability And Replay

Preserve state transitions, compact logs, token usage, and reproduction artifacts for Mesmer runs.

Mesmer treats replayable evidence as a core output, not a nice-to-have.

Successful runs should emit a reproduction artifact with:

  • Exact target replay messages.
  • Target metadata.
  • Target capabilities and target-call count.
  • Judgement details.
  • Operator transition trace.
  • Replay-critical candidate provenance.

Rich logs are optimized for humans. Compact logs are machine-readable JSONL and useful for automation or AI-assisted diagnosis.

export MESMER_LOG_FORMAT=compact

The runtime also preserves compact transition history for every operator. Logs alone are not enough for reproducible experiments.

ops.QueryTarget snapshots the candidate when the target is called. That matters in iterative techniques, because later operators may mutate the frontier while the replay artifact still needs the exact messages that produced a response.

For benchmark-level inspection, EvidenceMatrix keeps row-level attempt and evaluator data, while BudgetCurve makes query-budget tradeoffs visible.