Decisions
- Pending: Storage backend — JSON file, SQLite, or in-memory?
- Pending: Maximum concurrent agents?
- Pending: Priority formula — fixed categories or weighted scoring?
- Pending: Starvation prevention threshold — how quickly should low-priority jobs escalate?
User Tasks
Summary
Build a central job registry that tracks all work in the system and a priority queue that dispatches jobs to available agents based on urgency, dependencies, and resource availability.
Problem / Motivation
Opus currently works one task at a time in a single session. As the system grows with multiple agents (FR-043), complexity routing (FR-045), and automated triggers (FR-050), tasks will pile up and need centralized management. Without a job registry:
- There is no visibility into what is queued, running, or completed
- No way to prioritize urgent work over background tasks
- No dependency tracking between jobs
- No concurrent execution — agents sit idle while work waits
- User messages compete with automated tasks without priority differentiation
A job registry and priority queue solve this by becoming the central nervous system for all work in Opus.
Proposed Solution
Build a Python module with two core components:
Job Registry — A persistent store tracking every piece of work:
- Job ID, type, source (user, automated, agent-spawned)
- Status: queued, running, completed, failed, blocked
- Assigned agent (if running)
- Created, started, completed timestamps
- Dependencies (job B can’t start until job A finishes)
- Result/output summary
Priority Queue — Intelligent job ordering:
- Human messages always jump to the front — interactive requests take absolute priority
- Priority tiers: critical (user-interactive) > urgent (bug fixes) > normal (features) > low (maintenance, cleanup)
- Dependency-aware: jobs with unsatisfied dependencies stay blocked regardless of priority
- Starvation prevention: low-priority jobs gradually increase in priority over time so they eventually get processed
- Concurrency management: dispatch jobs to idle agents, respect max-concurrent limits
What This Enables:
- Kick off multiple features before bed, wake up to all completed
- While one agent implements FR-040, another reviews FR-050, and a third researches FR-039
- The system self-manages workload without manual scheduling
- Full audit trail of all work performed
Open Questions
1. Storage Backend
Question: Where should job data be persisted?
| Option | Description |
|---|---|
| A) JSON file | Simple, human-readable, easy to debug. May have concurrency issues with multiple writers |
| B) SQLite | Built into Python, proper querying, handles concurrent reads, single-file storage |
| C) In-memory only | Fastest, but data lost on restart. Only suitable if jobs are short-lived |
Recommendation: Option B — SQLite is the sweet spot. Built-in, fast, queryable, persists across restarts, handles concurrent access.
Decision:
2. Maximum Concurrent Agents
Question: How many agents should run simultaneously?
| Option | Description |
|---|---|
| A) 1 (sequential) | Simplest, no coordination needed. Defeats the purpose for Phase 3 |
| B) 3-4 concurrent | Good parallelism without overwhelming system resources |
| C) Unlimited | Maximum throughput, but could exhaust API limits or system resources |
Recommendation: Option B — start with 3, make it configurable. Respect API rate limits.
Decision:
3. Priority Formula
Question: How should job priority be calculated?
| Option | Description |
|---|---|
| A) Tier-based with aging | Fixed tiers (critical/urgent/normal/low) + time-based priority boost. Simple and predictable |
| B) Weighted scoring | Multi-factor score (urgency * impact * age * dependencies). More nuanced but harder to debug |
| C) User-assigned only | User manually sets priority for each job. Maximum control, minimum automation |
Recommendation: Option A — tier-based is transparent and easy to reason about. Aging prevents starvation naturally.
Decision:
Phase Overview
| Phase | Description | Status |
|---|---|---|
| Phase 1 | Job data model + registry | — |
| Phase 2 | Priority scoring + queue | — |
| Phase 3 | Concurrent dispatch | — |
| Phase 4 | Monitoring dashboard | — |
Phase 1: Job Data Model + Registry —
Goal: Define the job data model and build a persistent registry for tracking all work.
| File / Feature | Details | Owner | Status |
|---|---|---|---|
src/opus/jobs/models.py | Job dataclass: id, type, source, status, agent, timestamps, dependencies, result | opus | — |
src/opus/jobs/registry.py | Registry class: create, update, query, list jobs. Persistent storage backend | opus | — |
src/opus/jobs/storage.py | Storage adapter (SQLite or JSON, based on decision) | opus | — |
| Job status transitions | Define valid status transitions: queued → running → completed/failed, queued → blocked → queued | opus | — |
| Unit tests | CRUD operations, status transitions, persistence across restarts | mv | — |
Phase 2: Priority Scoring + Queue —
Goal: Implement priority-based job ordering with starvation prevention.
| File / Feature | Details | Owner | Status |
|---|---|---|---|
src/opus/jobs/priority.py | Priority tiers, scoring logic, aging algorithm | opus | — |
src/opus/jobs/queue.py | Priority queue: enqueue, dequeue (highest priority non-blocked job), peek | opus | — |
| Dependency tracking | Jobs blocked by incomplete dependencies stay in queue but are not dispatched | opus | — |
| Starvation prevention | Low-priority jobs gain priority points over time | opus | — |
| Human message priority | Interactive user messages always get critical priority | mv | — |
| Unit tests | Priority ordering, aging, dependency blocking, human message preemption | mv | — |
Phase 3: Concurrent Dispatch —
Goal: Dispatch jobs to multiple agents in parallel, managing agent availability and resource limits.
| File / Feature | Details | Owner | Status |
|---|---|---|---|
src/opus/jobs/dispatcher.py | Dispatch engine: assign jobs to idle agents, track agent availability | opus | — |
| Agent pool management | Track which agents are busy/idle, respect max-concurrent limit | opus | — |
| Job completion handling | On agent completion: update registry, check if blocked jobs are now unblocked | opus | — |
| Failure handling | On agent failure: mark job failed, retry policy, escalation to user | mv | — |
| Integration with FR-043 agents | Wire dispatcher to actual agent invocations | opus | — |
| Integration tests | Multi-job dispatch, concurrent execution, dependency resolution | mv | — |
Phase 4: Monitoring Dashboard —
Goal: Provide visibility into the job system state for the user.
| File / Feature | Details | Owner | Status |
|---|---|---|---|
/status jobs query | Show current queue state: running, queued, completed, failed | opus | — |
vault/00_system/dashboards/job-dashboard.md | Auto-generated dashboard: recent jobs, agent utilization, throughput stats | opus | — |
| Job history | Query completed jobs by date range, agent, type | opus | — |
| Performance metrics | Average job duration, success rate, queue wait time | opus | — |
Prerequisites / Gap Analysis
Requirements
| Requirement | Description |
|---|---|
| REQ-1 | Python project scaffold (FR-009) — jobs module is Python code |
| REQ-2 | Custom agents (FR-043) — dispatcher dispatches to agents |
Current State
| Component | Status | Details |
|---|---|---|
| Python scaffold | — | FR-009 not started |
| Agent system | — | FR-043 not started |
| Job tracking | — | No centralized job management exists |
Gap (What’s missing?)
| Gap | Effort | Blocker? |
|---|---|---|
| Python scaffold (FR-009) | Med | Yes — code needs a home |
| Agent system (FR-043) | High | Yes for Phase 3 — dispatch needs agents |
| Storage backend decision | Low | No — Phase 1 can start with either option |
Test
Manual tests
| Test | Expected | Actual | Last |
|---|---|---|---|
| Human message preempts queued jobs | User message jumps to front | pending | - |
AI-verified tests
| Scenario | Expected behavior | Verification method |
|---|---|---|
| … | … | … |
E2E tests
| Scenario | Assertion |
|---|---|
| … | … |
Integration tests
| Component | Coverage |
|---|---|
| … | … |
Unit tests
| Component | Tests | Coverage |
|---|---|---|
| … | … | … |
History
| Date | Event | Details |
|---|---|---|
| 2026-03-04 | Created | Inspired by Nexie’s job registry and priority queue architecture |
References
- FR-009 (Python Project Scaffold) — code infrastructure prerequisite
- FR-043 (Custom Agents) — agents are the workers that execute dispatched jobs
- FR-045 (Complexity Routing) — routing feeds scored jobs into the queue
- FR-031 (Workflow State Machine) — state machine triggers job creation on transitions
- FR-050 (Proactive Monitoring) — monitoring creates jobs when issues detected
- Nexie (Sven Hennig) — original inspiration for centralized job management