Agency Workflow Evaluation & Continuous Improvement

Decisions

Pending: Evaluation frequency — per-task, daily, weekly?
Pending: Metrics storage — SQLite, markdown, or both?
Pending: Who acts on evaluation results — human review or auto-tuning?

User Tasks

Summary

Measure, evaluate, and continuously improve the autonomous agency’s end-to-end workflow — from task intake to delivery — using structured metrics, retrospectives, and feedback loops.

Problem / Motivation

The agency pipeline (intake → classify → plan → code → test → review → deploy) has many stages, each with its own failure modes.
Without measurement, you can’t know: Is the agency actually saving time? Where do tasks get stuck? Which stages fail most? Are auto-advanced tasks succeeding?
FR-059 (Escalation Policy) Phase 3 wants to learn from approval patterns — it needs evaluation data.
FR-060 (Definition of Done) checks individual task completion, but nothing evaluates the workflow itself.
Autonomous systems degrade silently. A task that took 1 iteration last week might take 5 this week because of a subtle change — without metrics, you won’t notice.
The difference between a toy demo and a production agency is continuous evaluation and improvement.

Proposed Solution

An evaluation framework that tracks metrics across every pipeline stage, generates periodic reports, identifies bottlenecks and failure patterns, and proposes workflow improvements. Combines quantitative metrics (time, iterations, cost, success rate) with qualitative assessment (LLM-judged quality of outputs).

Open Questions

1. Evaluation Granularity

Question: At what level should the agency be evaluated?

Option	Description
A) Per-task + aggregate	Track each task through pipeline, aggregate for trends
B) Aggregate only	Weekly/monthly summaries, less storage
C) Per-stage + per-task	Detailed per-stage metrics for each task (most data)

Recommendation: Option C for data collection, Option A for reporting. Collect everything, surface summaries.

Decision:

2. Metrics Storage

Question: Where should evaluation data live?

Option	Description
A) SQLite + markdown reports	SQLite for raw data, periodic markdown reports in vault
B) Markdown only	Simple but hard to query at scale
C) SQLite only	Queryable but not visible in Obsidian

Recommendation: Option A — SQLite for the data, markdown reports for human consumption.

Decision:

3. Improvement Actions

Question: How should evaluation findings be acted on?

Option	Description
A) Propose improvements as FRs/proposals	Findings generate proposals in vault/90_inbox/proposals/
B) Auto-tune parameters	Automatically adjust thresholds, routing rules, etc.
C) Dashboard only	Show metrics, let human decide

Recommendation: Option A for Phase 1-2 (human reviews proposals), Option B for well-understood parameters in Phase 3.

Decision:

Phase Overview

Phase	Description	Status
Phase 1	Metrics collection + per-task tracking	—
Phase 2	Periodic reports + bottleneck detection	—
Phase 3	Auto-tuning + improvement proposals	—
Phase 4	Comparative evaluation (A/B testing workflow changes)	—

Phase 1: Metrics Collection —

Goal: Instrument every pipeline stage to collect performance data.

File / Feature	Details	Owner	Status
`src/opus/eval/metrics.py`	Metric definitions and collection API	opus	—
`src/opus/eval/storage.py`	SQLite storage for metrics data	opus	—
`src/opus/eval/tracker.py`	PipelineTracker: wraps orchestrator stages with timing/outcome capture	opus	—
DB schema	`tasks`, `stage_runs`, `metrics` tables	opus	—

Metrics to collect per task:

Metric	Description	Stage
`intake_to_dispatch_hours`	Time from task arrival to orchestrator pickup	Dispatch
`classification_accuracy`	Was auto-classification overridden by human?	Dispatch
`planning_iterations`	How many plan revisions before coding started	Planning
`coding_duration_minutes`	Wall-clock time in coding stage	Coding
`test_pass_rate`	% of tests passing on first run	Testing
`review_iterations`	Feedback loop count before approval	Review
`review_findings_by_severity`	Count of critical/error/warning/info per review	Review
`total_tokens_used`	Token consumption across all stages	All
`total_cost_usd`	Dollar cost of the full pipeline run	All
`end_to_end_hours`	Total time from intake to PR merged	All
`human_interventions`	Number of times escalated to human	All
`outcome`	success / reverted / abandoned / stuck	Final

Phase 2: Reports & Bottleneck Detection —

Goal: Generate periodic reports identifying where the pipeline is slow, failing, or expensive.

File / Feature	Details	Owner	Status
`src/opus/eval/reporter.py`	ReportGenerator: query metrics, produce markdown report	opus	—
`vault/00_system/dashboards/agency-performance.md`	Auto-generated performance dashboard	opus	—
`/agency-report` skill	Generate on-demand performance report	opus	—
Bottleneck detection	Flag stages where median time > threshold or failure rate > X%	opus	—
Trend analysis	Compare this week vs last week, detect degradation	opus	—

Report sections:

Section	Content
Summary	Tasks completed, in-progress, stuck. Overall success rate.
Pipeline health	Per-stage average duration, failure rate, iteration count
Bottlenecks	Stages exceeding thresholds, with specific task examples
Cost analysis	Total spend, cost per task, cost per stage, trend
Escalation analysis	Most common escalation reasons, auto-advance accuracy
Recommendations	Specific suggestions based on data patterns

Phase 3: Auto-Tuning & Improvement Proposals —

Goal: Use evaluation data to automatically improve the agency’s workflow.

File / Feature	Details	Owner	Status
`src/opus/eval/tuner.py`	ParameterTuner: adjust thresholds based on data	opus	—
Escalation tuning	Feed approval/rejection data to FR-059 Phase 3	mv	—
Classification tuning	Feed override data to FR-047 classifier	opus	—
Improvement proposals	Generate proposals in `vault/90_inbox/proposals/` for structural changes	opus	—
Proposal types	”Review stage takes 3x longer for TypeScript — add language-specific review rules”	mv	—

Auto-tunable parameters:

Parameter	Source	Tuning Logic
Auto-advance threshold	FR-047 dispatcher	Tasks always approved by human → lower threshold
Review strictness	FR-057 reviewer	Warnings never acted on → downgrade to info
Escalation triggers	FR-059 policy	Actions always approved → reduce risk level
Iteration limits	FR-056 orchestrator	Tasks converging in 1 iteration → reduce max
Coverage threshold	FR-060 done criteria	Projects consistently at 95% → raise from 80%

Phase 4: Comparative Evaluation —

Goal: A/B test workflow changes to measure their impact before committing.

File / Feature	Details	Owner	Status
`src/opus/eval/experiment.py`	ExperimentRunner: route tasks to variant pipelines	opus	—
Experiment tracking	Track which variant each task used, compare outcomes	opus	—
Statistical significance	Require N tasks before declaring a variant better	opus	—
Auto-rollback	Revert variant if metrics degrade beyond threshold	opus	—

Prerequisites / Gap Analysis

Requirements

Requirement	Description
REQ-0	Design doc reviewed and approved
REQ-1	FR-056 (Orchestrator) — instrument pipeline stages
REQ-2	FR-009 (Python scaffold) — code infrastructure
REQ-3	FR-053 (Cost Tracking) — token/cost data

Current State

Component	Status	Details
Pipeline instrumentation	—	Nothing exists
Metrics storage	—	Nothing exists
Reporting	—	FR-053 covers cost only
Self-improvement	new	FR-064 is high-level vision, this FR is the mechanism

Gap (What’s missing?)

Gap	Effort	Blocker?
Metrics collection framework	Medium	No
SQLite schema + storage	Low	No
Pipeline instrumentation	Medium	Depends on FR-056
Report generation	Medium	No
Auto-tuning logic	High	Depends on Phase 1-2 data

Test

Manual tests

Test	Expected	Actual	Last
Human overrides auto-advance 5x	Tuner proposes raising threshold for that task type	pending	-
Compare two review configurations	Experiment shows which produces fewer iterations	pending	-

AI-verified tests

Scenario	Expected behavior	Verification method
…	…	…

E2E tests

Scenario	Assertion
…	…

Integration tests

Component	Coverage
…	…

Unit tests

Component	Tests	Coverage
…	…	…

History

Date	Event	Details
2026-03-12	Created	Identified as critical for production-grade autonomous agency

References

FR-056 (Autonomous Coding Orchestrator) — primary instrumentation target
FR-047 (Task Dispatcher) — classification accuracy metrics
FR-059 (Escalation Policy) — approval pattern data feeds Phase 3 tuning
FR-060 (Definition of Done) — done-criteria thresholds as tunable parameters
FR-057 (Code Review Pipeline) — review metrics and strictness tuning
FR-053 (Cost & Token Tracking) — cost data integration
FR-064 (Self-Improving System) — high-level vision, this FR is the concrete mechanism

Opus Vault

Explorer

Agency Workflow Evaluation & Continuous Improvement

Decisions

User Tasks

Summary

Problem / Motivation

Proposed Solution

Open Questions

1. Evaluation Granularity

2. Metrics Storage

3. Improvement Actions

Phase Overview

Phase 1: Metrics Collection —

Phase 2: Reports & Bottleneck Detection —

Phase 3: Auto-Tuning & Improvement Proposals —

Phase 4: Comparative Evaluation —

Prerequisites / Gap Analysis

Requirements

Current State

Gap (What’s missing?)

Test

Manual tests

AI-verified tests

E2E tests

Integration tests

Unit tests

History

References

Graph View

Table of Contents