PRD — Agent Orchestration Harness
Reusable orchestration layer that triggers Claude Managed Agent sessions, streams their events, and routes their outputs. Built once for SEO Sentinel v1 (the first agent in the Local SEO Automation workflow) but designed as a generic harness that every future agent — Content Catalyst, Revenue Relay, Ad Arbitrage, Build Bot, PM Pulse — plugs into via a per-agent config file. This PRD pairs with prd_seo_sentinel_v1.html; the Sentinel PRD defines WHAT runs, this PRD defines HOW it gets triggered and where its output goes.
This orchestration harness is built as a separate, reusable service because it's not Sentinel-specific. Every future Managed Agent (Catalyst, Revenue Relay, etc.) will register with this harness via its agent config. Building it properly once saves rebuilding it 5 times later.
Source of truth split:
prd_seo_sentinel_v1.html— agent logic, skills, rubrics, modules, output contractprd_orchestration_harness.html— triggers, session lifecycle, HITL routing, deployment, monitoring (this doc)
Contents
- Executive summary
- Scope — v1 vs v2 vs v3
- Architecture overview
- Tech stack decisions
- Agent config schema
- Trigger sources
- Session lifecycle state machine
- SSE event handling
- Output routing
- Error handling & retries
- Idempotency
- Logging & monitoring
- HITL gate handler (v2 stub)
- VPS setup & deployment
- Project structure
- Environment variables
- Test plan
- Definition of Done
- Open questions
- Risks
- Appendix — reference code
1 Executive summary
A small Node.js service running on a VPS that (1) accepts triggers from ClickUp / Slack / cron / manual CLI, (2) reads a per-agent config to determine which Managed Agent to invoke, (3) resolves the client handoff payload from ClickUp or a passed ID, (4) creates an Anthropic session, (5) streams SSE events, (6) fetches output files when the session ends, (7) routes deliverables to Slack / ClickUp / Drive. Future agents plug in by adding a JSON config file — no code changes required for new agents that follow the standard pattern.
2 Scope — v1 vs v2 vs v3 Ship small, design for growth
| Feature | Version | Why this phasing |
|---|---|---|
Manual CLI trigger (./run-agent sentinel --client shamrock) | v1 | Lets Trung + Jake dogfood from day 1 without webhook infrastructure dependencies. |
| ClickUp webhook trigger (task status → "Ready (Automate)") | v1 | Primary production trigger. Fires Sentinel automatically when a client is ready for audit. |
| Session create + SSE stream | v1 | Core capability. Can't ship without this. |
| Output fetch + Slack/ClickUp/Drive delivery | v1 | The whole point of running the agent. |
| Agent config file pattern | v1 | Design-time cheap. Retrofitting later is expensive. Do it right the first time. |
| Basic error handling + Slack alerts | v1 | Required for production confidence. |
| Idempotency (prevent double-triggers) | v1 | Webhooks retry. Without idempotency, one task can spin up 3 sessions. |
Slack slash command (/sentinel audit <client>) | v2 | Nice-to-have. Ad-hoc reruns. Build after production stability. |
| Cron / scheduled triggers (monthly recurring) | v2 | Needed for Workflow 4 (Monthly Report). Not Sentinel v1 scope. |
| HITL gate handler (Slack Approve/Deny buttons) | v2 | Sentinel v1 is read-only — no gated actions. Becomes critical when PM Pulse ships client-facing outputs. |
| Multi-agent fanout (one trigger → multiple sessions) | v2 | Needed when PM Pulse delegates. Not v1 scope. |
| Retry queue with backoff (BullMQ or similar) | v2 | In-memory retry is fine for v1 volume (few runs per day). Upgrade when volume justifies. |
| Cost dashboard (per-session, per-client, per-agent) | v3 | Can read from Anthropic Console manually for now. Build when we have >5 agents running. |
| Multi-tenant / multi-workspace support | v3 | Only needed if we productize this for other agencies. YAGNI for now. |
v1 scope is everything needed to run Sentinel on a real client by Day 22. Everything else is explicit v2/v3.
3 Architecture overview
Design principles
- Single process, synchronous for v1. No workers, no queues. One HTTP server handles trigger → session → delivery in one flow. Resilient enough for <10 runs per day. Simpler = fewer failure modes.
- Agent-agnostic dispatcher. The dispatcher doesn't know what "Sentinel" does. It reads the agent's config file to determine which Managed Agent ID to invoke, which ClickUp fields to read from the triggering task, and which Slack/Drive channels to post to. Adding Catalyst later = new config file, zero code change.
- Local state in SQLite. No external database for v1. SQLite file on VPS tracks run history + idempotency keys + cost. Upgrade to Postgres when we need multi-process concurrency.
- Fail loud in Slack, not silent. Any unexpected error posts a formatted alert to
#seo-automation-alertswith session ID, agent name, error, and last event. Never swallow exceptions. - Event log for post-mortems. Every SSE event persisted to
./logs/sessions/{session_id}.jsonl. Size is manageable (~MB per session). Kept 90 days.
4 Tech stack decisions Opinionated defaults — swap if Trung prefers
| Component | Default choice | Why | Swap if |
|---|---|---|---|
| Language | Node.js 20+ (TypeScript) | Anthropic's TS SDK is the most feature-complete. Webhook handling + SSE streaming is idiomatic in Node. Jake's team already uses JS for GHL/ClickUp custom integrations. | Trung prefers Python — Python SDK is also fully supported; just use anthropic package and FastAPI. |
| HTTP framework | Express 4.x | Minimal, well-known, webhook-friendly. | Trung prefers Fastify (faster, better TS ergonomics) or Hono (edge-ready). Functionally equivalent for our scale. |
| State / persistence | better-sqlite3 (synchronous SQLite) | Single-file DB, no server, transactional, fast enough for our volume. | Expected volume >50 runs/day → upgrade to Postgres on same VPS. |
| Anthropic SDK | @anthropic-ai/sdk | Official. Sets beta headers automatically. TypeScript types for events. | No swap — this is the canonical client. |
| Config format | JSON files in ./agents/ | Diffable, version-controllable, no build step, zero tooling. | Larger team + environment differentiation → YAML with schema validation via Ajv. |
| Hosting | Hetzner CX22 (~$5/mo) or DigitalOcean droplet (~$12/mo) | Cheap. Enough CPU/memory. Single small service. Docker not needed for v1. | Already running a Kubernetes cluster — deploy there. Overkill otherwise. |
| Process manager | systemd service | Native on Ubuntu. Auto-restart on crash. Logs to journalctl. Simple. | Using PM2 elsewhere — fine to use here too. |
| HTTPS / webhook endpoint | Caddy reverse proxy | Auto-provisions Let's Encrypt cert. 5-line config. Zero fuss. | Using nginx elsewhere — also fine. |
| Secret management | systemd EnvironmentFile with 0600 perms OR .env + dotenv-safe | No external secret store needed at this scale. File-based with restrictive perms is adequate. | Organizational policy requires Vault / AWS Secrets Manager. |
5 Agent config schema The core abstraction that makes this reusable
One JSON file per agent in ./agents/{agent_name}.json. The dispatcher reads this to know how to handle the agent's session.
To add Content Catalyst later, Trung drops in ./agents/catalyst.json with different agent_id, kickoff prompt, required ClickUp fields, and delivery channels. Zero code changes to the dispatcher. Same for Revenue Relay, Ad Arbitrage, Build Bot. The abstraction pays for itself on the second agent.
6 Trigger sources How a run starts
Manual CLI v1
ClickUp webhook v1
ClickUp fires a webhook on task status change. Our endpoint receives it, validates, and triggers the run.
Handler responsibilities
- Verify HMAC signature using shared secret. Return 401 on mismatch.
- Idempotency check — has this
task_id + status_change_idalready been processed? If yes, return 200 immediately (don't re-run). - Resolve agent config — URL path tells us which agent (
/webhook/sentinel→sentinel.json). - Fetch task details from ClickUp API. Extract
required_fieldsfrom task custom fields. If any missing, post error to Slack, update task status to "Blocked", return 200. - Build kickoff payload, dispatch to session creation logic.
- Return 200 to ClickUp within 3s (webhook timeout). The actual session runs asynchronously — we acknowledge fast, work slow.
ClickUp webhooks time out at ~3 seconds and retry 3 times on failure. Don't wait for the full session to complete before responding. Return 200 immediately, run the session in a background promise. If the session fails later, alert via Slack, not via HTTP response.
Future triggers (v2)
Both follow the same dispatcher pattern — just different payload sources.
- Slack slash command
/sentinel audit <client-name>— POST to/slack/command, Slack HMAC verification, lookup client by name, trigger run. - Cron — systemd timer fires a binary that reads a schedule config (which agent, which clients, which frequency), dispatches runs. Used for monthly reporting.
7 Session lifecycle state machine The spine of the dispatcher
| State | Trigger event (from Anthropic SSE) | Dispatcher action |
|---|---|---|
CREATED | After POST /v1/sessions returns 201 | Record session_id + trigger_id in SQLite. Move to SENDING_KICKOFF. |
SENDING_KICKOFF | — | POST user.message event with kickoff prompt. Move to STREAMING. |
STREAMING | Any agent.* event | Log event to logs/sessions/{id}.jsonl. Extract text from agent.message events. |
STREAMING | session.status_idle with stop_reason: end_turn | Close stream. Move to FETCHING_OUTPUTS. |
STREAMING | session.status_idle with stop_reason: requires_action | v1: log unexpected, alert. v2: move to REQUIRES_ACTION, route to Slack approval. |
FETCHING_OUTPUTS | — | GET /v1/files?scope_id={session_id} with Files beta header. Download each to ./tmp/{session_id}/. Move to DELIVERING. |
DELIVERING | — | Execute each enabled delivery channel in config (Slack, ClickUp, Drive). Collect errors but don't abort. |
DONE | — | Update SQLite run log with final status + cost. Slack success message. Clean up tmp files. |
ERROR | Exception thrown anywhere | Log full error + session context. Slack alert to #seo-automation-alerts tagging @trung. Attempt to mark ClickUp task "Blocked". |
BUDGET_EXCEEDED | Cost sampler detects session exceeds max_session_cost_usd | Send user.interrupt event to halt agent. Move to TERMINATED. Alert Slack. |
Every ~30 seconds during STREAMING, fetch the session via GET /v1/sessions/{id} and compute cost from usage.input_tokens / output_tokens / cache_read_input_tokens. If projected exceeds alert_if_cost_exceeds_usd, Slack warning. If exceeds max_session_cost_usd, interrupt the session. Protects against runaway loops.
8 SSE event handling The event loop
Critical rules from the Managed Agents docs
- Send the user event BEFORE opening the stream. The API buffers events until a stream attaches. If you open the stream first, you may miss the initial burst.
- Always check
session.status_idleevents forstop_reason.end_turn= done.requires_action= session paused waiting on you. - Reconnect on stream drop. Network blip shouldn't kill a 30-minute session. On connection error, wait 2s and reopen the stream — Anthropic buffers events so you don't lose them.
- Log every event. Post-mortem debugging is impossible without the event log.
Event types we care about in v1
| Event type | What to do |
|---|---|
agent.message | Extract content[].text. Write to console in CLI mode. Nothing else — final output is in files, not messages. |
agent.tool_use | Log tool name + input summary. Useful for cost debugging. No action. |
agent.mcp_tool_use | Log MCP name + tool + input. Same — observability only. |
agent.custom_tool_use | v1: we have no custom tools. If this fires, it's a bug — alert. |
session.status_idle | Check stop_reason. Drive state machine. |
session.thread_created | v1: single-thread only. If this fires, log. v2: multi-agent fanout handling. |
span.* | Telemetry. Log for debugging. Ignore in v1 logic. |
Minimal handler implementation (TypeScript)
9 Output routing Where deliverables go
The flow
- List session files:
GET /v1/files?scope_id={session_id}withfiles-api-2025-04-14beta. - Download each file via
GET /v1/files/{id}/content. Save to./tmp/{session_id}/. - For each enabled delivery channel in
config.delivery, execute in parallel. Collect results. - Post a final summary to the main Slack channel with links to all deliverables.
- Clean up tmp files.
Channel handlers (abbreviated)
| Channel | What it does | Failure mode |
|---|---|---|
| Google Drive | Upload files to folder resolved from template (/Clients/{client_name}/Audits/). Create folder if missing. Record returned web_view_link for linking in other channels. | Log error, continue with other channels. Don't block Slack/ClickUp on Drive failure. |
| ClickUp | Post task comment with run summary + Drive links. Update task status per config. Attach output files directly to the task. | Retry once on 5xx. On persistent fail, Slack alert. |
| Slack | Post to #seo-automation: structured message with client name, agent name, run duration, cost, 3-bullet summary extracted from markdown output, link to Drive audit, link to ClickUp task. | Retry once. Slack is usually reliable. |
Slack message format (design)
10 Error handling & retries
| Failure | Detection | Response | Retry? |
|---|---|---|---|
| ClickUp webhook signature invalid | HMAC mismatch | Return 401, log source IP | No |
| ClickUp task missing required fields | Config schema validation on task fetch | Post clear error to Slack with list of missing fields. Update task status → "Blocked". Return 200 to webhook. | No (human fixes fields + re-triggers) |
| Anthropic session creation 5xx | HTTP status | Retry 3x with exponential backoff (1s, 4s, 16s). Log to SQLite run table. | Yes |
| Anthropic 4xx on session creation | HTTP status | Log, alert. Don't retry — it's our bug (bad agent_id, quota, etc.) | No |
| SSE stream dropped mid-run | Connection error on for await | Wait 2s, reopen stream. Anthropic buffers events, so we don't lose anything. | Yes (transparent) |
| Agent session exceeds budget | Cost sampler | Send user.interrupt, mark session TERMINATED, Slack alert | No |
| Agent session exceeds duration | Wall clock | Same as budget exceeded | No |
| Output file download fails | Files API error | Retry 3x. If persistent fail, Slack alert with session ID so human can manually fetch from Console. | Yes |
| Slack/ClickUp/Drive delivery fails | API error per channel | Retry once. Collect all delivery errors, post combined failure summary. Outputs remain in tmp/ until manually cleared. | Yes (once) |
| SQLite write failure | better-sqlite3 throws | Log to file. Alert. Run continues — state is reconstructable from event log. | No |
11 Idempotency
ClickUp webhooks retry on non-2xx response. Without idempotency, a slow response + retry can spawn duplicate sessions — two parallel Sentinel runs on the same client, wasting ~$3.50 and producing conflicting deliverables.
Idempotency key derivation
For webhooks: SHA256(agent_name + task_id + status_change_timestamp). Stored in idempotency_keys SQLite table with 24h TTL.
Flow
- On webhook arrival, compute key.
- INSERT OR IGNORE into table with key + session_id (or null if still creating).
- If INSERT succeeded → new run, proceed. If IGNORED → duplicate, return 200 with
{already_processed: true, original_session: ...}. - After session creation succeeds, UPDATE row with session_id.
12 Logging & monitoring
Three log streams
- Application log (pino) → stdout → journalctl. Structured JSON. Level: info in prod, debug if
DEBUG=1. - Session event log →
./logs/sessions/{session_id}.jsonl. Raw SSE events. 90-day retention, cron cleans up older. - Run history → SQLite
runstable: session_id, agent, trigger_source, payload_hash, started_at, ended_at, status, total_cost, error.
Monitoring — day-1 minimum
- Process alive: systemd auto-restart on crash. Uptime Kuma or equivalent hitting a
/healthendpoint every 5min. - Disk space: cron checks session log dir, Slack alert if >80% full.
- Failed runs: cron queries
runstable hourly, Slack summary of any ERROR status. - Cost anomalies: daily cron sums runs from SQLite, Slack alert if daily total >1.5× baseline.
Monitoring — later (v2+)
Grafana + Prometheus feels excessive for a 1-service VPS. Revisit when we have >3 agents + high volume.
13 HITL gate handler Stub for v2
Sentinel v1 is read-only — it audits, it doesn't publish anything. So no Human-In-The-Loop gates are required. HITL becomes critical in Sprint 2 when PM Pulse coordinates agents that produce client-facing deliverables (email drafts, GBP posts, published pages).
Design for when we add it
When the agent fires agent.tool_use with a tool that requires confirmation (configured in the Anthropic agent definition), the session emits session.status_idle with stop_reason: requires_action and requires_action.event_ids[] listing the blocking events.
Our dispatcher's REQUIRES_ACTION state handler will:
- Post a Slack interactive message with Approve / Deny buttons and the tool call context.
- Store pending session state in SQLite.
- Return control (release the stream connection).
- When human responds via Slack interaction, POST
user.tool_confirmationevent withresult: allowordeny+ optionaldeny_message. - Reopen stream, resume STREAMING state.
For v1: stub that logs unexpected requires_action + Slack alerts. Full implementation in Sprint 2.
14 VPS setup & deployment
15 Project structure
16 Environment variables
17 Test plan
Unit tests (Vitest)
- Config loader: valid config parses, missing required field throws, unknown agent returns null.
- Idempotency: first call inserts, duplicate call returns existing.
- Payload resolver: builds payload from mock ClickUp task, handles missing fields.
- Template rendering:
{payload_json}and{client_id}substitute correctly. - Cost sampler: computes correct USD from token counts.
Integration tests (run against Anthropic)
- OT1: Manual CLI trigger with synthetic payload → session creates, event stream flows, at least one
agent.messagereceived, session reachesend_turn. Uses a minimal throwaway test agent to avoid Sentinel cost. - OT2: Same but with bad
SENTINEL_AGENT_ID→ errors gracefully, Slack alert fires, no session record leaked. - OT3: Simulate stream drop (kill connection mid-session) → reconnect happens, events don't repeat in log, session completes.
- OT4: Webhook endpoint with valid HMAC → returns 200 within 3s, session starts in background.
- OT5: Webhook endpoint with invalid HMAC → returns 401, no session created.
- OT6: Idempotency: fire same webhook 2x within 1s → second returns
already_processed: true, no duplicate session. - OT7: Delivery routing: mock session complete with fake output files → Slack post, ClickUp comment, Drive upload all succeed. Verify content manually on first run.
End-to-end (alongside Sentinel T1)
- Full Sentinel run via orchestrator CLI with synthetic client.
- Full Sentinel run via ClickUp webhook on a test task.
18 Definition of Done
- Service running on VPS under systemd with auto-restart, HTTPS via Caddy.
/healthendpoint returns 200.agents/sentinel.jsonexists and passes schema validation.- CLI trigger (
./bin/run-agent sentinel ...) works end-to-end with synthetic payload. - ClickUp webhook trigger works end-to-end on a test task.
- All 7 integration tests (OT1–OT7) pass.
- Successful Sentinel T1 run → Slack post received, ClickUp task commented, Drive file visible.
- Failed run injection (invalid agent ID) → Slack alert to
#seo-automation-alertstagging @trung. - README in repo documents: how to deploy, how to add a new agent, how to read logs, how to rotate keys.
- SQLite
runstable + event log working. Can querySELECT * FROM runs WHERE agent='sentinel' ORDER BY started_at DESCand see history. - Jake + Trung have walked through the deployment together once (knowledge transfer).
19 Open questions
Things I don't know for certain — verify before committing.
- ClickUp webhook HMAC algorithm and header name. I've assumed
X-Signaturewith HMAC-SHA256 but Trung should verify from ClickUp API docs. Blocker for: webhook security. Low risk — just look it up. - Session reconnect behavior after long disconnect. If the VPS reboots mid-session, can we reconnect to the stream on recovery? Or is the session lost? Docs suggest events persist server-side. Worth verifying with a deliberate test. Not blocking v1 — rare edge case.
- Concurrent session limit per org. Rate limits state 60/min for create endpoints, but is there a cap on concurrent running sessions? We'd only hit it with high volume, but worth knowing. Ask Anthropic sales.
- ClickUp API rate limits when reading task details. Free tier is stricter. Verify our tier allows enough reads. Non-blocker — upgrade plan if needed.
- How Google Drive service account permissions propagate to shared drives. If client folders live in a shared drive, the SA needs explicit membership. Verify first client's folder setup before assuming this works. Non-blocker — standard Drive setup question.
- Slack bot token scopes needed.
chat:writeandfiles:writeat minimum. Interactive message support needsinteractivityin app config (v2 only).
20 Risks & mitigations
| Risk | Severity | Mitigation |
|---|---|---|
| VPS goes down; webhooks fail silently | High | Uptime monitor (UptimeRobot free) hits /health every 5min. SMS Trung on 3 consecutive fails. ClickUp will retry 3x; gives us ~15min to recover before data loss. |
| Webhook signature secret leaks (e.g., committed to git) | High | Pre-commit hook scans for CLICKUP_WEBHOOK_SECRET pattern. Rotate on suspicion. Use .env with chmod 600, gitignored. |
| Runaway session burns through budget | Med | Per-session cost cap + interrupt. Daily total cost alert. Both enforced client-side. |
| Duplicate sessions from webhook retries | Med | Idempotency (Section 11). Tested as part of OT6. |
| Stream disconnect causes partial delivery (Slack posted, Drive not uploaded) | Med | Idempotent delivery: each channel writes a completion marker to SQLite. Re-run of delivery phase is safe. |
| Disk fills from session event logs | Low | Daily cron cleans logs older than 90 days. Disk usage alert at 80%. |
| Anthropic API outage | Low | Session creation retries with backoff. If still failing, Slack alert + ClickUp task stays in "Ready" — human re-triggers when up. |
| Breaking change in Managed Agents beta | Low-Med | Pin SDK version. Subscribe to Anthropic release notes. Integration tests run weekly to catch drift. |
| Adding a second agent reveals that config schema is insufficient | Med | Design schema from Sentinel + at least think through what Content Catalyst will need. Explicit v1.0 version field on config so we can handle migrations. |
| Trung as single owner (bus factor) | Med | README must be detailed enough that a competent dev could take over. Walk Jake through the codebase once at Sprint 1 end. |
21 Appendix — reference code
A1. package.json (minimal)
A2. SQLite schema
A3. Health endpoint
- Managed Agents — events & streaming
- Managed Agents — quickstart
- ClickUp API + webhook docs
- Slack — verifying request signatures
- Paired PRD:
prd_seo_sentinel_v1.html