Product Requirements Document · v1.0 · April 19, 2026

PRD — Agent Orchestration Harness

Reusable orchestration layer that triggers Claude Managed Agent sessions, streams their events, and routes their outputs. Built once for SEO Sentinel v1 (the first agent in the Local SEO Automation workflow) but designed as a generic harness that every future agent — Content Catalyst, Revenue Relay, Ad Arbitrage, Build Bot, PM Pulse — plugs into via a per-agent config file. This PRD pairs with prd_seo_sentinel_v1.html; the Sentinel PRD defines WHAT runs, this PRD defines HOW it gets triggered and where its output goes.

Owner: Trung (responsible) · Jake (accountable) Target ship date: End of Sprint 1 W3 (Day 22) Est. IT effort: ~14–18 hours Reusable by: all 6 agents in the swarm roadmap

🔗 Relationship to the Sentinel PRD

This orchestration harness is built as a separate, reusable service because it's not Sentinel-specific. Every future Managed Agent (Catalyst, Revenue Relay, etc.) will register with this harness via its agent config. Building it properly once saves rebuilding it 5 times later.

Source of truth split:

prd_seo_sentinel_v1.html — agent logic, skills, rubrics, modules, output contract
prd_orchestration_harness.html — triggers, session lifecycle, HITL routing, deployment, monitoring (this doc)

1 Executive summary

A small Node.js service running on a VPS that (1) accepts triggers from ClickUp / Slack / cron / manual CLI, (2) reads a per-agent config to determine which Managed Agent to invoke, (3) resolves the client handoff payload from ClickUp or a passed ID, (4) creates an Anthropic session, (5) streams SSE events, (6) fetches output files when the session ends, (7) routes deliverables to Slack / ClickUp / Drive. Future agents plug in by adding a JSON config file — no code changes required for new agents that follow the standard pattern.

2 Scope — v1 vs v2 vs v3 Ship small, design for growth

Feature	Version	Why this phasing
Manual CLI trigger (`./run-agent sentinel --client shamrock`)	v1	Lets Trung + Jake dogfood from day 1 without webhook infrastructure dependencies.
ClickUp webhook trigger (task status → "Ready (Automate)")	v1	Primary production trigger. Fires Sentinel automatically when a client is ready for audit.
Session create + SSE stream	v1	Core capability. Can't ship without this.
Output fetch + Slack/ClickUp/Drive delivery	v1	The whole point of running the agent.
Agent config file pattern	v1	Design-time cheap. Retrofitting later is expensive. Do it right the first time.
Basic error handling + Slack alerts	v1	Required for production confidence.
Idempotency (prevent double-triggers)	v1	Webhooks retry. Without idempotency, one task can spin up 3 sessions.
Slack slash command (`/sentinel audit <client>`)	v2	Nice-to-have. Ad-hoc reruns. Build after production stability.
Cron / scheduled triggers (monthly recurring)	v2	Needed for Workflow 4 (Monthly Report). Not Sentinel v1 scope.
HITL gate handler (Slack Approve/Deny buttons)	v2	Sentinel v1 is read-only — no gated actions. Becomes critical when PM Pulse ships client-facing outputs.
Multi-agent fanout (one trigger → multiple sessions)	v2	Needed when PM Pulse delegates. Not v1 scope.
Retry queue with backoff (BullMQ or similar)	v2	In-memory retry is fine for v1 volume (few runs per day). Upgrade when volume justifies.
Cost dashboard (per-session, per-client, per-agent)	v3	Can read from Anthropic Console manually for now. Build when we have >5 agents running.
Multi-tenant / multi-workspace support	v3	Only needed if we productize this for other agencies. YAGNI for now.

v1 scope is everything needed to run Sentinel on a real client by Day 22. Everything else is explicit v2/v3.

3 Architecture overview

# Single service, single process, multiple entry points ┌─────────────────────────────────┐ ┌──────────┐ │ │ │ ClickUp │───┤ HTTP server (Express) │ │ webhook │ │ POST /webhook/:agent_name │ └──────────┘ │ │ │ CLI: ./run-agent <name> ... │ ┌──────────┐ │ │ │ Manual │───┤ │ │ CLI │ │ ┌───────────────────────────┐ │ └──────────┘ │ │ Dispatcher │ │ │ │ 1. Load agent config │ │ │ │ 2. Resolve client payload │ │ │ │ 3. Create session │ │ │ │ 4. Stream + process events│ │ │ │ 5. Fetch outputs │ │ │ │ 6. Route deliverables │ │ │ └─────────┬─────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────┐ │ │ │ SQLite (local) │ │ │ │ - run log │ │ │ │ - idempotency keys │ │ │ │ - cost tracking │ │ │ └────────────────────┘ │ └────────────┬────────────────────┘ │ ┌─────────────┼──────────────┐ ▼ ▼ ▼ Anthropic API ClickUp API Slack API (sessions, (task read + (post results) events, update) files) Google Drive API (upload audit files)

Design principles

Single process, synchronous for v1. No workers, no queues. One HTTP server handles trigger → session → delivery in one flow. Resilient enough for <10 runs per day. Simpler = fewer failure modes.
Agent-agnostic dispatcher. The dispatcher doesn't know what "Sentinel" does. It reads the agent's config file to determine which Managed Agent ID to invoke, which ClickUp fields to read from the triggering task, and which Slack/Drive channels to post to. Adding Catalyst later = new config file, zero code change.
Local state in SQLite. No external database for v1. SQLite file on VPS tracks run history + idempotency keys + cost. Upgrade to Postgres when we need multi-process concurrency.
Fail loud in Slack, not silent. Any unexpected error posts a formatted alert to #seo-automation-alerts with session ID, agent name, error, and last event. Never swallow exceptions.
Event log for post-mortems. Every SSE event persisted to ./logs/sessions/{session_id}.jsonl. Size is manageable (~MB per session). Kept 90 days.

4 Tech stack decisions Opinionated defaults — swap if Trung prefers

Component	Default choice	Why	Swap if
Language	Node.js 20+ (TypeScript)	Anthropic's TS SDK is the most feature-complete. Webhook handling + SSE streaming is idiomatic in Node. Jake's team already uses JS for GHL/ClickUp custom integrations.	Trung prefers Python — Python SDK is also fully supported; just use `anthropic` package and FastAPI.
HTTP framework	Express 4.x	Minimal, well-known, webhook-friendly.	Trung prefers Fastify (faster, better TS ergonomics) or Hono (edge-ready). Functionally equivalent for our scale.
State / persistence	better-sqlite3 (synchronous SQLite)	Single-file DB, no server, transactional, fast enough for our volume.	Expected volume >50 runs/day → upgrade to Postgres on same VPS.
Anthropic SDK	`@anthropic-ai/sdk`	Official. Sets beta headers automatically. TypeScript types for events.	No swap — this is the canonical client.
Config format	JSON files in `./agents/`	Diffable, version-controllable, no build step, zero tooling.	Larger team + environment differentiation → YAML with schema validation via Ajv.
Hosting	Hetzner CX22 (~$5/mo) or DigitalOcean droplet (~$12/mo)	Cheap. Enough CPU/memory. Single small service. Docker not needed for v1.	Already running a Kubernetes cluster — deploy there. Overkill otherwise.
Process manager	systemd service	Native on Ubuntu. Auto-restart on crash. Logs to journalctl. Simple.	Using PM2 elsewhere — fine to use here too.
HTTPS / webhook endpoint	Caddy reverse proxy	Auto-provisions Let's Encrypt cert. 5-line config. Zero fuss.	Using nginx elsewhere — also fine.
Secret management	systemd `EnvironmentFile` with 0600 perms OR `.env` + `dotenv-safe`	No external secret store needed at this scale. File-based with restrictive perms is adequate.	Organizational policy requires Vault / AWS Secrets Manager.

5 Agent config schema The core abstraction that makes this reusable

One JSON file per agent in ./agents/{agent_name}.json. The dispatcher reads this to know how to handle the agent's session.

// ./agents/sentinel.json — v1 config for SEO Sentinel { "agent_name": "sentinel", "display_name": "SEO Sentinel", "description": "Local SEO audit — runs 5 modules on a single client", "version": "1.0.0", // Anthropic IDs — from agent + environment creation (Sentinel PRD §7, §8) "anthropic": { "agent_id_env": "SENTINEL_AGENT_ID", // read from env var "environment_id_env": "SEONAV_ENV_ID", // read from env var "beta_headers": ["managed-agents-2026-04-01"], "additional_beta_headers_for_output_fetch": ["files-api-2025-04-14"] }, // Where the kickoff payload comes from "payload_source": { "type": "clickup_task", // or "manual" or "slack_command" "clickup": { // Custom fields on the ClickUp task that we read into the payload "required_fields": [ "client_id", "client_name", "business_address", "gbp_url", "website_url", "service_area_cities", "seed_keywords", "competitors" ], "optional_fields": ["priority_services_ranked", "business_usps"] } }, // What triggers a run "triggers": { "clickup_webhook": { "enabled": true, "list_id": "<LIST_ID_FOR_SENTINEL_TASKS>", "trigger_status": "Ready (Automate)" }, "slack_command": { "enabled": false }, // v2 "cron": { "enabled": false }, // v2 "manual_cli": { "enabled": true } }, // How the user.message is constructed "kickoff_template": { "prompt": "Run a full Local SEO audit for this client. Execute all 5 modules. Write structured output to /mnt/session/outputs/sentinel-audit-{client_id}.json and .md.\n\nCLIENT PAYLOAD:\n{payload_json}" }, // Timeouts "limits": { "max_session_duration_minutes": 60, "max_session_cost_usd": 5.00, "alert_if_cost_exceeds_usd": 3.00 }, // Where outputs go "delivery": { "slack": { "enabled": true, "channel": "#seo-automation", "post_summary": true, "post_links_to_outputs": true }, "clickup": { "enabled": true, "update_task_status": "Review", "post_comment": true, "attach_output_files": true }, "google_drive": { "enabled": true, "folder_template": "/Clients/{client_name}/Audits/", "file_name_template": "sentinel-audit-{timestamp}" } }, // Alerting "alerts": { "on_failure_channel": "#seo-automation-alerts", "tag_on_failure": ["@trung"] } }

✨ Why this schema pays off

To add Content Catalyst later, Trung drops in ./agents/catalyst.json with different agent_id, kickoff prompt, required ClickUp fields, and delivery channels. Zero code changes to the dispatcher. Same for Revenue Relay, Ad Arbitrage, Build Bot. The abstraction pays for itself on the second agent.

6 Trigger sources How a run starts

Manual CLI v1

# Primary day-1 trigger for Trung + SEO Lead dogfooding ./bin/run-agent sentinel \ --client shamrock-detailing-columbus \ --clickup-task tk_abc123 # optional: resolves payload from task # Or with inline payload (for T1 synthetic client) ./bin/run-agent sentinel --payload-file ./fixtures/synthetic-client.json # Prints: session ID, live SSE stream output, final deliverable paths # Exits: 0 on success (session ended with end_turn), non-zero on failure

ClickUp webhook v1

ClickUp fires a webhook on task status change. Our endpoint receives it, validates, and triggers the run.

# HTTP endpoint POST https://orch.seonavigator.online/webhook/sentinel Content-Type: application/json X-Signature: <ClickUp HMAC signature> { "event": "taskStatusUpdated", "task_id": "tk_abc123", "history_items": [{ "after": { "status": "Ready (Automate)" } }] }

Handler responsibilities

Verify HMAC signature using shared secret. Return 401 on mismatch.
Idempotency check — has this task_id + status_change_id already been processed? If yes, return 200 immediately (don't re-run).
Resolve agent config — URL path tells us which agent (/webhook/sentinel → sentinel.json).
Fetch task details from ClickUp API. Extract required_fields from task custom fields. If any missing, post error to Slack, update task status to "Blocked", return 200.
Build kickoff payload, dispatch to session creation logic.
Return 200 to ClickUp within 3s (webhook timeout). The actual session runs asynchronously — we acknowledge fast, work slow.

⚠️ Webhook must return fast

ClickUp webhooks time out at ~3 seconds and retry 3 times on failure. Don't wait for the full session to complete before responding. Return 200 immediately, run the session in a background promise. If the session fails later, alert via Slack, not via HTTP response.

Future triggers (v2)

Both follow the same dispatcher pattern — just different payload sources.

Slack slash command /sentinel audit <client-name> — POST to /slack/command, Slack HMAC verification, lookup client by name, trigger run.
Cron — systemd timer fires a binary that reads a schedule config (which agent, which clients, which frequency), dispatches runs. Used for monthly reporting.

7 Session lifecycle state machine The spine of the dispatcher

CREATED ──▶ SENDING_KICKOFF ──▶ STREAMING ──▶ FETCHING_OUTPUTS ──▶ DELIVERING ──▶ DONE │ ├──▶ REQUIRES_ACTION (v2: HITL gate) ──▶ AWAITING_HUMAN ──▶ STREAMING │ ├──▶ BUDGET_EXCEEDED ──▶ TERMINATED │ └──▶ ERROR ──▶ ALERTED

State	Trigger event (from Anthropic SSE)	Dispatcher action
`CREATED`	After `POST /v1/sessions` returns 201	Record session_id + trigger_id in SQLite. Move to SENDING_KICKOFF.
`SENDING_KICKOFF`	—	POST user.message event with kickoff prompt. Move to STREAMING.
`STREAMING`	Any `agent.*` event	Log event to `logs/sessions/{id}.jsonl`. Extract text from `agent.message` events.
`STREAMING`	`session.status_idle` with `stop_reason: end_turn`	Close stream. Move to FETCHING_OUTPUTS.
`STREAMING`	`session.status_idle` with `stop_reason: requires_action`	v1: log unexpected, alert. v2: move to REQUIRES_ACTION, route to Slack approval.
`FETCHING_OUTPUTS`	—	`GET /v1/files?scope_id={session_id}` with Files beta header. Download each to `./tmp/{session_id}/`. Move to DELIVERING.
`DELIVERING`	—	Execute each enabled delivery channel in config (Slack, ClickUp, Drive). Collect errors but don't abort.
`DONE`	—	Update SQLite run log with final status + cost. Slack success message. Clean up tmp files.
`ERROR`	Exception thrown anywhere	Log full error + session context. Slack alert to `#seo-automation-alerts` tagging @trung. Attempt to mark ClickUp task "Blocked".
`BUDGET_EXCEEDED`	Cost sampler detects session exceeds `max_session_cost_usd`	Send `user.interrupt` event to halt agent. Move to TERMINATED. Alert Slack.

💡 Cost sampling for budget exceeded

Every ~30 seconds during STREAMING, fetch the session via GET /v1/sessions/{id} and compute cost from usage.input_tokens / output_tokens / cache_read_input_tokens. If projected exceeds alert_if_cost_exceeds_usd, Slack warning. If exceeds max_session_cost_usd, interrupt the session. Protects against runaway loops.

8 SSE event handling The event loop

Critical rules from the Managed Agents docs

Send the user event BEFORE opening the stream. The API buffers events until a stream attaches. If you open the stream first, you may miss the initial burst.
Always check session.status_idle events for stop_reason. end_turn = done. requires_action = session paused waiting on you.
Reconnect on stream drop. Network blip shouldn't kill a 30-minute session. On connection error, wait 2s and reopen the stream — Anthropic buffers events so you don't lose them.
Log every event. Post-mortem debugging is impossible without the event log.

Event types we care about in v1

Event type	What to do
`agent.message`	Extract `content[].text`. Write to console in CLI mode. Nothing else — final output is in files, not messages.
`agent.tool_use`	Log tool name + input summary. Useful for cost debugging. No action.
`agent.mcp_tool_use`	Log MCP name + tool + input. Same — observability only.
`agent.custom_tool_use`	v1: we have no custom tools. If this fires, it's a bug — alert.
`session.status_idle`	Check `stop_reason`. Drive state machine.
`session.thread_created`	v1: single-thread only. If this fires, log. v2: multi-agent fanout handling.
`span.*`	Telemetry. Log for debugging. Ignore in v1 logic.

Minimal handler implementation (TypeScript)

// lib/dispatcher.ts — core event loop (abbreviated) import Anthropic from '@anthropic-ai/sdk'; export async function runAgentSession(config: AgentConfig, payload: ClientPayload) { const anthropic = new Anthropic(); // 1. Create session const session = await anthropic.beta.sessions.create({ agent: process.env[config.anthropic.agent_id_env]!, environment_id: process.env[config.anthropic.environment_id_env]!, title: `${config.display_name} · ${payload.client_name}` }, { headers: { 'anthropic-beta': config.anthropic.beta_headers.join(',') } }); await db.recordRunStart({ session_id: session.id, agent: config.agent_name, payload }); // 2. Send kickoff BEFORE streaming const kickoffText = renderTemplate(config.kickoff_template.prompt, { payload_json: JSON.stringify(payload, null, 2), client_id: payload.client_id }); await anthropic.beta.sessions.events.create(session.id, { events: [{ type: 'user.message', content: [{ type: 'text', text: kickoffText }] }] }); // 3. Stream events with reconnect let done = false; const startedAt = Date.now(); while (!done) { try { const stream = await anthropic.beta.sessions.stream(session.id); for await (const event of stream) { await appendEventLog(session.id, event); switch (event.type) { case 'agent.message': logger.info({ session: session.id, text: extractText(event) }, 'agent.message'); break; case 'agent.tool_use': case 'agent.mcp_tool_use': logger.debug({ session: session.id, tool: event.name }, 'tool_use'); break; case 'session.status_idle': if (event.stop_reason?.type === 'end_turn') { done = true; } else if (event.stop_reason?.type === 'requires_action') { // v1: not expected, alert and halt await alertSlack(config, session.id, 'Unexpected requires_action in v1'); done = true; } break; } // Budget check every event (cheap) if (Date.now() - startedAt > config.limits.max_session_duration_minutes * 60_000) { await interruptSession(anthropic, session.id, 'duration_exceeded'); done = true; } } } catch (err) { logger.warn({ err, session: session.id }, 'stream dropped, reconnecting in 2s'); await sleep(2000); } } // 4. Fetch outputs + deliver await deliverOutputs(config, session.id, payload); await db.recordRunComplete(session.id); }

9 Output routing Where deliverables go

The flow

List session files: GET /v1/files?scope_id={session_id} with files-api-2025-04-14 beta.
Download each file via GET /v1/files/{id}/content. Save to ./tmp/{session_id}/.
For each enabled delivery channel in config.delivery, execute in parallel. Collect results.
Post a final summary to the main Slack channel with links to all deliverables.
Clean up tmp files.

Channel handlers (abbreviated)

Channel	What it does	Failure mode
Google Drive	Upload files to folder resolved from template (`/Clients/{client_name}/Audits/`). Create folder if missing. Record returned `web_view_link` for linking in other channels.	Log error, continue with other channels. Don't block Slack/ClickUp on Drive failure.
ClickUp	Post task comment with run summary + Drive links. Update task status per config. Attach output files directly to the task.	Retry once on 5xx. On persistent fail, Slack alert.
Slack	Post to `#seo-automation`: structured message with client name, agent name, run duration, cost, 3-bullet summary extracted from markdown output, link to Drive audit, link to ClickUp task.	Retry once. Slack is usually reliable.

Slack message format (design)

╭────────────────────────────────────────────────╮ │ 🛰️ SEO Sentinel audit complete │ │ │ │ Client: Shamrock Detailing (Columbus, OH) │ │ Duration: 23m 14s · Cost: $1.82 │ │ │ │ ⭐ Overall Score: 71/100 │ │ │ │ Top 3 priorities: │ │ • P1: Add 8 missing GBP services │ │ • P1: Fix NAP inconsistency in 6 directories │ │ • P2: Build 5 city pages for service area │ │ │ │ 📄 Full audit: [Drive link] │ │ 📋 ClickUp task: [Task link] │ │ 🔍 Session trace: [Console link] │ ╰────────────────────────────────────────────────╯

10 Error handling & retries

Failure	Detection	Response	Retry?
ClickUp webhook signature invalid	HMAC mismatch	Return 401, log source IP	No
ClickUp task missing required fields	Config schema validation on task fetch	Post clear error to Slack with list of missing fields. Update task status → "Blocked". Return 200 to webhook.	No (human fixes fields + re-triggers)
Anthropic session creation 5xx	HTTP status	Retry 3x with exponential backoff (1s, 4s, 16s). Log to SQLite run table.	Yes
Anthropic 4xx on session creation	HTTP status	Log, alert. Don't retry — it's our bug (bad agent_id, quota, etc.)	No
SSE stream dropped mid-run	Connection error on `for await`	Wait 2s, reopen stream. Anthropic buffers events, so we don't lose anything.	Yes (transparent)
Agent session exceeds budget	Cost sampler	Send `user.interrupt`, mark session TERMINATED, Slack alert	No
Agent session exceeds duration	Wall clock	Same as budget exceeded	No
Output file download fails	Files API error	Retry 3x. If persistent fail, Slack alert with session ID so human can manually fetch from Console.	Yes
Slack/ClickUp/Drive delivery fails	API error per channel	Retry once. Collect all delivery errors, post combined failure summary. Outputs remain in tmp/ until manually cleared.	Yes (once)
SQLite write failure	better-sqlite3 throws	Log to file. Alert. Run continues — state is reconstructable from event log.	No

11 Idempotency

🔁 Why this matters

ClickUp webhooks retry on non-2xx response. Without idempotency, a slow response + retry can spawn duplicate sessions — two parallel Sentinel runs on the same client, wasting ~$3.50 and producing conflicting deliverables.

Idempotency key derivation

For webhooks: SHA256(agent_name + task_id + status_change_timestamp). Stored in idempotency_keys SQLite table with 24h TTL.

Flow

On webhook arrival, compute key.
INSERT OR IGNORE into table with key + session_id (or null if still creating).
If INSERT succeeded → new run, proceed. If IGNORED → duplicate, return 200 with {already_processed: true, original_session: ...}.
After session creation succeeds, UPDATE row with session_id.

// Simplified const key = sha256(`${agent_name}:${task_id}:${status_change_ts}`); const inserted = db.prepare( `INSERT OR IGNORE INTO idempotency_keys (key, created_at) VALUES (?, ?)` ).run(key, Date.now()); if (inserted.changes === 0) { logger.info({ key }, 'duplicate webhook, ignoring'); return res.json({ already_processed: true }); } // ... proceed with session creation

12 Logging & monitoring

Three log streams

Application log (pino) → stdout → journalctl. Structured JSON. Level: info in prod, debug if DEBUG=1.
Session event log → ./logs/sessions/{session_id}.jsonl. Raw SSE events. 90-day retention, cron cleans up older.
Run history → SQLite runs table: session_id, agent, trigger_source, payload_hash, started_at, ended_at, status, total_cost, error.

Monitoring — day-1 minimum

Process alive: systemd auto-restart on crash. Uptime Kuma or equivalent hitting a /health endpoint every 5min.
Disk space: cron checks session log dir, Slack alert if >80% full.
Failed runs: cron queries runs table hourly, Slack summary of any ERROR status.
Cost anomalies: daily cron sums runs from SQLite, Slack alert if daily total >1.5× baseline.

Monitoring — later (v2+)

Grafana + Prometheus feels excessive for a 1-service VPS. Revisit when we have >3 agents + high volume.

13 HITL gate handler Stub for v2

🚧 Intentionally deferred

Sentinel v1 is read-only — it audits, it doesn't publish anything. So no Human-In-The-Loop gates are required. HITL becomes critical in Sprint 2 when PM Pulse coordinates agents that produce client-facing deliverables (email drafts, GBP posts, published pages).

Design for when we add it

When the agent fires agent.tool_use with a tool that requires confirmation (configured in the Anthropic agent definition), the session emits session.status_idle with stop_reason: requires_action and requires_action.event_ids[] listing the blocking events.

Our dispatcher's REQUIRES_ACTION state handler will:

Post a Slack interactive message with Approve / Deny buttons and the tool call context.
Store pending session state in SQLite.
Return control (release the stream connection).
When human responds via Slack interaction, POST user.tool_confirmation event with result: allow or deny + optional deny_message.
Reopen stream, resume STREAMING state.

For v1: stub that logs unexpected requires_action + Slack alerts. Full implementation in Sprint 2.

14 VPS setup & deployment

# One-time VPS provisioning — Ubuntu 24.04 LTS on Hetzner CX22 # Base apt update && apt upgrade -y apt install -y build-essential git curl ufw fail2ban # Node 20 LTS curl -fsSL https://deb.nodesource.com/setup_20.x | bash - apt install -y nodejs # Firewall ufw allow 22/tcp ufw allow 80/tcp ufw allow 443/tcp ufw enable # User for the service (not root) useradd -m -s /bin/bash orch usermod -aG sudo orch # temp, remove after initial setup # Caddy reverse proxy for HTTPS apt install -y debian-keyring debian-archive-keyring apt-transport-https curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | tee /etc/apt/sources.list.d/caddy-stable.list apt update && apt install -y caddy # As user 'orch': su - orch cd ~ git clone <repo> orchestrator cd orchestrator npm ci cp .env.example .env # edit .env with secrets (ANTHROPIC_API_KEY, SENTINEL_AGENT_ID, etc.) chmod 600 .env npm run build # As root: exit # back to root # systemd service cat > /etc/systemd/system/orchestrator.service <<'EOF' [Unit] Description=SEO Navigator Agent Orchestrator After=network.target [Service] Type=simple User=orch WorkingDirectory=/home/orch/orchestrator EnvironmentFile=/home/orch/orchestrator/.env ExecStart=/usr/bin/node dist/server.js Restart=always RestartSec=5 StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable --now orchestrator systemctl status orchestrator # verify running # Caddy config cat > /etc/caddy/Caddyfile <<'EOF' orch.seonavigator.online { reverse_proxy localhost:3000 encode gzip log { output file /var/log/caddy/access.log } } EOF systemctl reload caddy # auto-provisions Let's Encrypt cert # Done. Test: curl https://orch.seonavigator.online/health # {"status":"ok","uptime":42}

15 Project structure

orchestrator/ ├── src/ │ ├── server.ts # Express app, HTTP routes │ ├── dispatcher.ts # Core session lifecycle │ ├── config.ts # Load + validate agent configs │ ├── payload-resolver.ts # Build client payload from ClickUp/manual │ ├── delivery/ │ │ ├── slack.ts │ │ ├── clickup.ts │ │ └── drive.ts │ ├── db.ts # SQLite wrapper │ ├── logger.ts # pino setup │ └── util/ │ ├── idempotency.ts │ ├── sse-stream.ts │ └── cost-sampler.ts │ ├── agents/ # Per-agent configs — the extensibility point │ └── sentinel.json │ # Future: catalyst.json, revenue-relay.json, etc. │ ├── bin/ │ └── run-agent # CLI for manual triggers │ ├── fixtures/ │ └── synthetic-client.json # T1 test payload │ ├── logs/ # Runtime — gitignored │ └── sessions/ │ ├── data/ │ └── orchestrator.db # SQLite — gitignored │ ├── migrations/ │ └── 001_initial_schema.sql │ ├── tests/ │ ├── dispatcher.test.ts │ └── idempotency.test.ts │ ├── .env.example ├── package.json ├── tsconfig.json └── README.md

16 Environment variables

# .env.example — Trung customizes this on the VPS # Anthropic ANTHROPIC_API_KEY=sk-ant-... SENTINEL_AGENT_ID=agent_abc123 SEONAV_ENV_ID=env_xyz789 # ClickUp CLICKUP_API_TOKEN=pk_... CLICKUP_WEBHOOK_SECRET=<shared secret for HMAC> CLICKUP_WORKSPACE_ID=9018614428 # Slack SLACK_BOT_TOKEN=xoxb-... SLACK_SIGNING_SECRET=... # for v2 slash command HMAC SLACK_ALERT_CHANNEL_ID=C... SLACK_RESULTS_CHANNEL_ID=C... # Google Drive (service account or OAuth) GOOGLE_SERVICE_ACCOUNT_JSON_PATH=/home/orch/orchestrator/gdrive-sa.json GOOGLE_DRIVE_ROOT_FOLDER_ID=1abc... # Service config PORT=3000 NODE_ENV=production LOG_LEVEL=info DATABASE_PATH=/home/orch/orchestrator/data/orchestrator.db SESSION_LOG_DIR=/home/orch/orchestrator/logs/sessions # Optional: cost alert thresholds (fallback if not in agent config) DEFAULT_MAX_SESSION_COST_USD=5.00 DAILY_COST_ALERT_THRESHOLD_USD=30.00

17 Test plan

Unit tests (Vitest)

Config loader: valid config parses, missing required field throws, unknown agent returns null.
Idempotency: first call inserts, duplicate call returns existing.
Payload resolver: builds payload from mock ClickUp task, handles missing fields.
Template rendering: {payload_json} and {client_id} substitute correctly.
Cost sampler: computes correct USD from token counts.

Integration tests (run against Anthropic)

OT1: Manual CLI trigger with synthetic payload → session creates, event stream flows, at least one agent.message received, session reaches end_turn. Uses a minimal throwaway test agent to avoid Sentinel cost.
OT2: Same but with bad SENTINEL_AGENT_ID → errors gracefully, Slack alert fires, no session record leaked.
OT3: Simulate stream drop (kill connection mid-session) → reconnect happens, events don't repeat in log, session completes.
OT4: Webhook endpoint with valid HMAC → returns 200 within 3s, session starts in background.
OT5: Webhook endpoint with invalid HMAC → returns 401, no session created.
OT6: Idempotency: fire same webhook 2x within 1s → second returns already_processed: true, no duplicate session.
OT7: Delivery routing: mock session complete with fake output files → Slack post, ClickUp comment, Drive upload all succeed. Verify content manually on first run.

End-to-end (alongside Sentinel T1)

Full Sentinel run via orchestrator CLI with synthetic client.
Full Sentinel run via ClickUp webhook on a test task.

18 Definition of Done

✅ Orchestration harness v1 ships when

Service running on VPS under systemd with auto-restart, HTTPS via Caddy.
/health endpoint returns 200.
agents/sentinel.json exists and passes schema validation.
CLI trigger (./bin/run-agent sentinel ...) works end-to-end with synthetic payload.
ClickUp webhook trigger works end-to-end on a test task.
All 7 integration tests (OT1–OT7) pass.
Successful Sentinel T1 run → Slack post received, ClickUp task commented, Drive file visible.
Failed run injection (invalid agent ID) → Slack alert to #seo-automation-alerts tagging @trung.
README in repo documents: how to deploy, how to add a new agent, how to read logs, how to rotate keys.
SQLite runs table + event log working. Can query SELECT * FROM runs WHERE agent='sentinel' ORDER BY started_at DESC and see history.
Jake + Trung have walked through the deployment together once (knowledge transfer).

19 Open questions

Things I don't know for certain — verify before committing.

ClickUp webhook HMAC algorithm and header name. I've assumed X-Signature with HMAC-SHA256 but Trung should verify from ClickUp API docs. Blocker for: webhook security. Low risk — just look it up.
Session reconnect behavior after long disconnect. If the VPS reboots mid-session, can we reconnect to the stream on recovery? Or is the session lost? Docs suggest events persist server-side. Worth verifying with a deliberate test. Not blocking v1 — rare edge case.
Concurrent session limit per org. Rate limits state 60/min for create endpoints, but is there a cap on concurrent running sessions? We'd only hit it with high volume, but worth knowing. Ask Anthropic sales.
ClickUp API rate limits when reading task details. Free tier is stricter. Verify our tier allows enough reads. Non-blocker — upgrade plan if needed.
How Google Drive service account permissions propagate to shared drives. If client folders live in a shared drive, the SA needs explicit membership. Verify first client's folder setup before assuming this works. Non-blocker — standard Drive setup question.
Slack bot token scopes needed. chat:write and files:write at minimum. Interactive message support needs interactivity in app config (v2 only).

20 Risks & mitigations

Risk	Severity	Mitigation
VPS goes down; webhooks fail silently	High	Uptime monitor (UptimeRobot free) hits `/health` every 5min. SMS Trung on 3 consecutive fails. ClickUp will retry 3x; gives us ~15min to recover before data loss.
Webhook signature secret leaks (e.g., committed to git)	High	Pre-commit hook scans for `CLICKUP_WEBHOOK_SECRET` pattern. Rotate on suspicion. Use `.env` with `chmod 600`, gitignored.
Runaway session burns through budget	Med	Per-session cost cap + interrupt. Daily total cost alert. Both enforced client-side.
Duplicate sessions from webhook retries	Med	Idempotency (Section 11). Tested as part of OT6.
Stream disconnect causes partial delivery (Slack posted, Drive not uploaded)	Med	Idempotent delivery: each channel writes a completion marker to SQLite. Re-run of delivery phase is safe.
Disk fills from session event logs	Low	Daily cron cleans logs older than 90 days. Disk usage alert at 80%.
Anthropic API outage	Low	Session creation retries with backoff. If still failing, Slack alert + ClickUp task stays in "Ready" — human re-triggers when up.
Breaking change in Managed Agents beta	Low-Med	Pin SDK version. Subscribe to Anthropic release notes. Integration tests run weekly to catch drift.
Adding a second agent reveals that config schema is insufficient	Med	Design schema from Sentinel + at least think through what Content Catalyst will need. Explicit v1.0 version field on config so we can handle migrations.
Trung as single owner (bus factor)	Med	README must be detailed enough that a competent dev could take over. Walk Jake through the codebase once at Sprint 1 end.

21 Appendix — reference code

A1. package.json (minimal)

{ "name": "seonav-orchestrator", "version": "1.0.0", "type": "module", "scripts": { "build": "tsc", "start": "node dist/server.js", "dev": "tsx watch src/server.ts", "test": "vitest", "migrate": "node dist/migrate.js" }, "dependencies": { "@anthropic-ai/sdk": "latest", "express": "^4.21.0", "better-sqlite3": "^11", "pino": "^9", "dotenv": "^16", "zod": "^3", // config schema validation "commander": "^12", // CLI parsing "googleapis": "^140", // Drive "@slack/web-api": "^7" }, "devDependencies": { "typescript": "^5", "tsx": "^4", "vitest": "^1", "@types/node": "^20", "@types/express": "^4" } }

A2. SQLite schema

-- migrations/001_initial_schema.sql CREATE TABLE runs ( session_id TEXT PRIMARY KEY, agent_name TEXT NOT NULL, trigger_source TEXT NOT NULL, -- 'webhook' | 'cli' | 'slack' | 'cron' client_id TEXT, payload_hash TEXT, started_at INTEGER NOT NULL, -- unix ms ended_at INTEGER, status TEXT NOT NULL, -- see state machine total_cost_usd REAL, error_message TEXT, metadata_json TEXT ); CREATE INDEX idx_runs_agent_started ON runs(agent_name, started_at DESC); CREATE INDEX idx_runs_client ON runs(client_id); CREATE TABLE idempotency_keys ( key TEXT PRIMARY KEY, session_id TEXT, created_at INTEGER NOT NULL ); -- TTL cleanup: DELETE FROM idempotency_keys WHERE created_at < ? (24h ago) -- Run daily via cron. CREATE TABLE delivery_attempts ( id INTEGER PRIMARY KEY AUTOINCREMENT, session_id TEXT NOT NULL, channel TEXT NOT NULL, -- 'slack' | 'clickup' | 'drive' status TEXT NOT NULL, -- 'success' | 'failed' response TEXT, attempted_at INTEGER NOT NULL, FOREIGN KEY (session_id) REFERENCES runs(session_id) );

A3. Health endpoint

// src/server.ts (excerpt) app.get('/health', (_, res) => { res.json({ status: 'ok', uptime: process.uptime(), version: pkg.version, agents_loaded: Object.keys(loadedConfigs), db_ok: db.prepare('SELECT 1').get() !== undefined }); });

📚 References