Pending: Evaluation frequency — per-task, daily, weekly?
Pending: Metrics storage — SQLite, markdown, or both?
Pending: Who acts on evaluation results — human review or auto-tuning?
User Tasks
Summary
Measure, evaluate, and continuously improve the autonomous agency’s end-to-end workflow — from task intake to delivery — using structured metrics, retrospectives, and feedback loops.
Problem / Motivation
The agency pipeline (intake → classify → plan → code → test → review → deploy) has many stages, each with its own failure modes.
Without measurement, you can’t know: Is the agency actually saving time? Where do tasks get stuck? Which stages fail most? Are auto-advanced tasks succeeding?
FR-059 (Escalation Policy) Phase 3 wants to learn from approval patterns — it needs evaluation data.
FR-060 (Definition of Done) checks individual task completion, but nothing evaluates the workflow itself.
Autonomous systems degrade silently. A task that took 1 iteration last week might take 5 this week because of a subtle change — without metrics, you won’t notice.
The difference between a toy demo and a production agency is continuous evaluation and improvement.
Proposed Solution
An evaluation framework that tracks metrics across every pipeline stage, generates periodic reports, identifies bottlenecks and failure patterns, and proposes workflow improvements. Combines quantitative metrics (time, iterations, cost, success rate) with qualitative assessment (LLM-judged quality of outputs).
Open Questions
1. Evaluation Granularity
Question: At what level should the agency be evaluated?
Option
Description
A) Per-task + aggregate
Track each task through pipeline, aggregate for trends
B) Aggregate only
Weekly/monthly summaries, less storage
C) Per-stage + per-task
Detailed per-stage metrics for each task (most data)
Recommendation: Option C for data collection, Option A for reporting. Collect everything, surface summaries.
Decision:
2. Metrics Storage
Question: Where should evaluation data live?
Option
Description
A) SQLite + markdown reports
SQLite for raw data, periodic markdown reports in vault
B) Markdown only
Simple but hard to query at scale
C) SQLite only
Queryable but not visible in Obsidian
Recommendation: Option A — SQLite for the data, markdown reports for human consumption.
Decision:
3. Improvement Actions
Question: How should evaluation findings be acted on?
Option
Description
A) Propose improvements as FRs/proposals
Findings generate proposals in vault/90_inbox/proposals/
B) Auto-tune parameters
Automatically adjust thresholds, routing rules, etc.
C) Dashboard only
Show metrics, let human decide
Recommendation: Option A for Phase 1-2 (human reviews proposals), Option B for well-understood parameters in Phase 3.