Decisions

  • Pending: Evaluation frequency — per-task, daily, weekly?
  • Pending: Metrics storage — SQLite, markdown, or both?
  • Pending: Who acts on evaluation results — human review or auto-tuning?

User Tasks


Summary

Measure, evaluate, and continuously improve the autonomous agency’s end-to-end workflow — from task intake to delivery — using structured metrics, retrospectives, and feedback loops.

Problem / Motivation

  • The agency pipeline (intake → classify → plan → code → test → review → deploy) has many stages, each with its own failure modes.
  • Without measurement, you can’t know: Is the agency actually saving time? Where do tasks get stuck? Which stages fail most? Are auto-advanced tasks succeeding?
  • FR-059 (Escalation Policy) Phase 3 wants to learn from approval patterns — it needs evaluation data.
  • FR-060 (Definition of Done) checks individual task completion, but nothing evaluates the workflow itself.
  • Autonomous systems degrade silently. A task that took 1 iteration last week might take 5 this week because of a subtle change — without metrics, you won’t notice.
  • The difference between a toy demo and a production agency is continuous evaluation and improvement.

Proposed Solution

An evaluation framework that tracks metrics across every pipeline stage, generates periodic reports, identifies bottlenecks and failure patterns, and proposes workflow improvements. Combines quantitative metrics (time, iterations, cost, success rate) with qualitative assessment (LLM-judged quality of outputs).


Open Questions

1. Evaluation Granularity

Question: At what level should the agency be evaluated?

OptionDescription
A) Per-task + aggregateTrack each task through pipeline, aggregate for trends
B) Aggregate onlyWeekly/monthly summaries, less storage
C) Per-stage + per-taskDetailed per-stage metrics for each task (most data)

Recommendation: Option C for data collection, Option A for reporting. Collect everything, surface summaries.

Decision:

2. Metrics Storage

Question: Where should evaluation data live?

OptionDescription
A) SQLite + markdown reportsSQLite for raw data, periodic markdown reports in vault
B) Markdown onlySimple but hard to query at scale
C) SQLite onlyQueryable but not visible in Obsidian

Recommendation: Option A — SQLite for the data, markdown reports for human consumption.

Decision:

3. Improvement Actions

Question: How should evaluation findings be acted on?

OptionDescription
A) Propose improvements as FRs/proposalsFindings generate proposals in vault/90_inbox/proposals/
B) Auto-tune parametersAutomatically adjust thresholds, routing rules, etc.
C) Dashboard onlyShow metrics, let human decide

Recommendation: Option A for Phase 1-2 (human reviews proposals), Option B for well-understood parameters in Phase 3.

Decision:


Phase Overview

PhaseDescriptionStatus
Phase 1Metrics collection + per-task tracking
Phase 2Periodic reports + bottleneck detection
Phase 3Auto-tuning + improvement proposals
Phase 4Comparative evaluation (A/B testing workflow changes)

Phase 1: Metrics Collection —

Goal: Instrument every pipeline stage to collect performance data.

File / FeatureDetailsOwnerStatus
src/opus/eval/metrics.pyMetric definitions and collection APIopus
src/opus/eval/storage.pySQLite storage for metrics dataopus
src/opus/eval/tracker.pyPipelineTracker: wraps orchestrator stages with timing/outcome captureopus
DB schematasks, stage_runs, metrics tablesopus

Metrics to collect per task:

MetricDescriptionStage
intake_to_dispatch_hoursTime from task arrival to orchestrator pickupDispatch
classification_accuracyWas auto-classification overridden by human?Dispatch
planning_iterationsHow many plan revisions before coding startedPlanning
coding_duration_minutesWall-clock time in coding stageCoding
test_pass_rate% of tests passing on first runTesting
review_iterationsFeedback loop count before approvalReview
review_findings_by_severityCount of critical/error/warning/info per reviewReview
total_tokens_usedToken consumption across all stagesAll
total_cost_usdDollar cost of the full pipeline runAll
end_to_end_hoursTotal time from intake to PR mergedAll
human_interventionsNumber of times escalated to humanAll
outcomesuccess / reverted / abandoned / stuckFinal

Phase 2: Reports & Bottleneck Detection —

Goal: Generate periodic reports identifying where the pipeline is slow, failing, or expensive.

File / FeatureDetailsOwnerStatus
src/opus/eval/reporter.pyReportGenerator: query metrics, produce markdown reportopus
vault/00_system/dashboards/agency-performance.mdAuto-generated performance dashboardopus
/agency-report skillGenerate on-demand performance reportopus
Bottleneck detectionFlag stages where median time > threshold or failure rate > X%opus
Trend analysisCompare this week vs last week, detect degradationopus

Report sections:

SectionContent
SummaryTasks completed, in-progress, stuck. Overall success rate.
Pipeline healthPer-stage average duration, failure rate, iteration count
BottlenecksStages exceeding thresholds, with specific task examples
Cost analysisTotal spend, cost per task, cost per stage, trend
Escalation analysisMost common escalation reasons, auto-advance accuracy
RecommendationsSpecific suggestions based on data patterns

Phase 3: Auto-Tuning & Improvement Proposals —

Goal: Use evaluation data to automatically improve the agency’s workflow.

File / FeatureDetailsOwnerStatus
src/opus/eval/tuner.pyParameterTuner: adjust thresholds based on dataopus
Escalation tuningFeed approval/rejection data to FR-059 Phase 3mv
Classification tuningFeed override data to FR-047 classifieropus
Improvement proposalsGenerate proposals in vault/90_inbox/proposals/ for structural changesopus
Proposal types”Review stage takes 3x longer for TypeScript — add language-specific review rules”mv

Auto-tunable parameters:

ParameterSourceTuning Logic
Auto-advance thresholdFR-047 dispatcherTasks always approved by human → lower threshold
Review strictnessFR-057 reviewerWarnings never acted on → downgrade to info
Escalation triggersFR-059 policyActions always approved → reduce risk level
Iteration limitsFR-056 orchestratorTasks converging in 1 iteration → reduce max
Coverage thresholdFR-060 done criteriaProjects consistently at 95% → raise from 80%

Phase 4: Comparative Evaluation —

Goal: A/B test workflow changes to measure their impact before committing.

File / FeatureDetailsOwnerStatus
src/opus/eval/experiment.pyExperimentRunner: route tasks to variant pipelinesopus
Experiment trackingTrack which variant each task used, compare outcomesopus
Statistical significanceRequire N tasks before declaring a variant betteropus
Auto-rollbackRevert variant if metrics degrade beyond thresholdopus

Prerequisites / Gap Analysis

Requirements

RequirementDescription
REQ-0Design doc reviewed and approved
REQ-1FR-056 (Orchestrator) — instrument pipeline stages
REQ-2FR-009 (Python scaffold) — code infrastructure
REQ-3FR-053 (Cost Tracking) — token/cost data

Current State

ComponentStatusDetails
Pipeline instrumentationNothing exists
Metrics storageNothing exists
ReportingFR-053 covers cost only
Self-improvementnewFR-064 is high-level vision, this FR is the mechanism

Gap (What’s missing?)

GapEffortBlocker?
Metrics collection frameworkMediumNo
SQLite schema + storageLowNo
Pipeline instrumentationMediumDepends on FR-056
Report generationMediumNo
Auto-tuning logicHighDepends on Phase 1-2 data

Test

Manual tests

TestExpectedActualLast
Human overrides auto-advance 5xTuner proposes raising threshold for that task typepending-
Compare two review configurationsExperiment shows which produces fewer iterationspending-

AI-verified tests

ScenarioExpected behaviorVerification method

E2E tests

ScenarioAssertion

Integration tests

ComponentCoverage

Unit tests

ComponentTestsCoverage

History

DateEventDetails
2026-03-12CreatedIdentified as critical for production-grade autonomous agency

References

  • FR-056 (Autonomous Coding Orchestrator) — primary instrumentation target
  • FR-047 (Task Dispatcher) — classification accuracy metrics
  • FR-059 (Escalation Policy) — approval pattern data feeds Phase 3 tuning
  • FR-060 (Definition of Done) — done-criteria thresholds as tunable parameters
  • FR-057 (Code Review Pipeline) — review metrics and strictness tuning
  • FR-053 (Cost & Token Tracking) — cost data integration
  • FR-064 (Self-Improving System) — high-level vision, this FR is the concrete mechanism