Related: For the Sentinel PRD defining WHAT runs, see PRD Local Automation · This doc = orchestration harness (HOW it triggers) · That doc = agent definition (WHAT runs)
SN
SEO NavigatorPRD · Orchestration Harness
Product Requirements Document · v1.0 · April 19, 2026

PRD — Agent Orchestration Harness

Reusable orchestration layer that triggers Claude Managed Agent sessions, streams their events, and routes their outputs. Built once for SEO Sentinel v1 (the first agent in the Local SEO Automation workflow) but designed as a generic harness that every future agent — Content Catalyst, Revenue Relay, Ad Arbitrage, Build Bot, PM Pulse — plugs into via a per-agent config file. This PRD pairs with prd_seo_sentinel_v1.html; the Sentinel PRD defines WHAT runs, this PRD defines HOW it gets triggered and where its output goes.

Owner: Trung (responsible) · Jake (accountable) Target ship date: End of Sprint 1 W3 (Day 22) Est. IT effort: ~14–18 hours Reusable by: all 6 agents in the swarm roadmap
🔗 Relationship to the Sentinel PRD

This orchestration harness is built as a separate, reusable service because it's not Sentinel-specific. Every future Managed Agent (Catalyst, Revenue Relay, etc.) will register with this harness via its agent config. Building it properly once saves rebuilding it 5 times later.

Source of truth split:

Contents

  1. Executive summary
  2. Scope — v1 vs v2 vs v3
  3. Architecture overview
  4. Tech stack decisions
  5. Agent config schema
  6. Trigger sources
  7. Session lifecycle state machine
  8. SSE event handling
  9. Output routing
  10. Error handling & retries
  11. Idempotency
  12. Logging & monitoring
  13. HITL gate handler (v2 stub)
  14. VPS setup & deployment
  15. Project structure
  16. Environment variables
  17. Test plan
  18. Definition of Done
  19. Open questions
  20. Risks
  21. Appendix — reference code

1 Executive summary

A small Node.js service running on a VPS that (1) accepts triggers from ClickUp / Slack / cron / manual CLI, (2) reads a per-agent config to determine which Managed Agent to invoke, (3) resolves the client handoff payload from ClickUp or a passed ID, (4) creates an Anthropic session, (5) streams SSE events, (6) fetches output files when the session ends, (7) routes deliverables to Slack / ClickUp / Drive. Future agents plug in by adding a JSON config file — no code changes required for new agents that follow the standard pattern.

2 Scope — v1 vs v2 vs v3 Ship small, design for growth

FeatureVersionWhy this phasing
Manual CLI trigger (./run-agent sentinel --client shamrock)v1Lets Trung + Jake dogfood from day 1 without webhook infrastructure dependencies.
ClickUp webhook trigger (task status → "Ready (Automate)")v1Primary production trigger. Fires Sentinel automatically when a client is ready for audit.
Session create + SSE streamv1Core capability. Can't ship without this.
Output fetch + Slack/ClickUp/Drive deliveryv1The whole point of running the agent.
Agent config file patternv1Design-time cheap. Retrofitting later is expensive. Do it right the first time.
Basic error handling + Slack alertsv1Required for production confidence.
Idempotency (prevent double-triggers)v1Webhooks retry. Without idempotency, one task can spin up 3 sessions.
Slack slash command (/sentinel audit <client>)v2Nice-to-have. Ad-hoc reruns. Build after production stability.
Cron / scheduled triggers (monthly recurring)v2Needed for Workflow 4 (Monthly Report). Not Sentinel v1 scope.
HITL gate handler (Slack Approve/Deny buttons)v2Sentinel v1 is read-only — no gated actions. Becomes critical when PM Pulse ships client-facing outputs.
Multi-agent fanout (one trigger → multiple sessions)v2Needed when PM Pulse delegates. Not v1 scope.
Retry queue with backoff (BullMQ or similar)v2In-memory retry is fine for v1 volume (few runs per day). Upgrade when volume justifies.
Cost dashboard (per-session, per-client, per-agent)v3Can read from Anthropic Console manually for now. Build when we have >5 agents running.
Multi-tenant / multi-workspace supportv3Only needed if we productize this for other agencies. YAGNI for now.

v1 scope is everything needed to run Sentinel on a real client by Day 22. Everything else is explicit v2/v3.

3 Architecture overview

# Single service, single process, multiple entry points ┌─────────────────────────────────┐ ┌──────────┐ │ │ │ ClickUp │───┤ HTTP server (Express) │ │ webhook │ │ POST /webhook/:agent_name │ └──────────┘ │ │ │ CLI: ./run-agent <name> ... │ ┌──────────┐ │ │ │ Manual │───┤ │ │ CLI │ │ ┌───────────────────────────┐ │ └──────────┘ │ │ Dispatcher │ │ │ │ 1. Load agent config │ │ │ │ 2. Resolve client payload │ │ │ │ 3. Create session │ │ │ │ 4. Stream + process events│ │ │ │ 5. Fetch outputs │ │ │ │ 6. Route deliverables │ │ │ └─────────┬─────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────┐ │ │ │ SQLite (local) │ │ │ │ - run log │ │ │ │ - idempotency keys │ │ │ │ - cost tracking │ │ │ └────────────────────┘ │ └────────────┬────────────────────┘ │ ┌─────────────┼──────────────┐ ▼ ▼ ▼ Anthropic API ClickUp API Slack API (sessions, (task read + (post results) events, update) files) Google Drive API (upload audit files)

Design principles

  1. Single process, synchronous for v1. No workers, no queues. One HTTP server handles trigger → session → delivery in one flow. Resilient enough for <10 runs per day. Simpler = fewer failure modes.
  2. Agent-agnostic dispatcher. The dispatcher doesn't know what "Sentinel" does. It reads the agent's config file to determine which Managed Agent ID to invoke, which ClickUp fields to read from the triggering task, and which Slack/Drive channels to post to. Adding Catalyst later = new config file, zero code change.
  3. Local state in SQLite. No external database for v1. SQLite file on VPS tracks run history + idempotency keys + cost. Upgrade to Postgres when we need multi-process concurrency.
  4. Fail loud in Slack, not silent. Any unexpected error posts a formatted alert to #seo-automation-alerts with session ID, agent name, error, and last event. Never swallow exceptions.
  5. Event log for post-mortems. Every SSE event persisted to ./logs/sessions/{session_id}.jsonl. Size is manageable (~MB per session). Kept 90 days.

4 Tech stack decisions Opinionated defaults — swap if Trung prefers

ComponentDefault choiceWhySwap if
LanguageNode.js 20+ (TypeScript)Anthropic's TS SDK is the most feature-complete. Webhook handling + SSE streaming is idiomatic in Node. Jake's team already uses JS for GHL/ClickUp custom integrations.Trung prefers Python — Python SDK is also fully supported; just use anthropic package and FastAPI.
HTTP frameworkExpress 4.xMinimal, well-known, webhook-friendly.Trung prefers Fastify (faster, better TS ergonomics) or Hono (edge-ready). Functionally equivalent for our scale.
State / persistencebetter-sqlite3 (synchronous SQLite)Single-file DB, no server, transactional, fast enough for our volume.Expected volume >50 runs/day → upgrade to Postgres on same VPS.
Anthropic SDK@anthropic-ai/sdkOfficial. Sets beta headers automatically. TypeScript types for events.No swap — this is the canonical client.
Config formatJSON files in ./agents/Diffable, version-controllable, no build step, zero tooling.Larger team + environment differentiation → YAML with schema validation via Ajv.
HostingHetzner CX22 (~$5/mo) or DigitalOcean droplet (~$12/mo)Cheap. Enough CPU/memory. Single small service. Docker not needed for v1.Already running a Kubernetes cluster — deploy there. Overkill otherwise.
Process managersystemd serviceNative on Ubuntu. Auto-restart on crash. Logs to journalctl. Simple.Using PM2 elsewhere — fine to use here too.
HTTPS / webhook endpointCaddy reverse proxyAuto-provisions Let's Encrypt cert. 5-line config. Zero fuss.Using nginx elsewhere — also fine.
Secret managementsystemd EnvironmentFile with 0600 perms OR .env + dotenv-safeNo external secret store needed at this scale. File-based with restrictive perms is adequate.Organizational policy requires Vault / AWS Secrets Manager.

5 Agent config schema The core abstraction that makes this reusable

One JSON file per agent in ./agents/{agent_name}.json. The dispatcher reads this to know how to handle the agent's session.

// ./agents/sentinel.json — v1 config for SEO Sentinel { "agent_name": "sentinel", "display_name": "SEO Sentinel", "description": "Local SEO audit — runs 5 modules on a single client", "version": "1.0.0", // Anthropic IDs — from agent + environment creation (Sentinel PRD §7, §8) "anthropic": { "agent_id_env": "SENTINEL_AGENT_ID", // read from env var "environment_id_env": "SEONAV_ENV_ID", // read from env var "beta_headers": ["managed-agents-2026-04-01"], "additional_beta_headers_for_output_fetch": ["files-api-2025-04-14"] }, // Where the kickoff payload comes from "payload_source": { "type": "clickup_task", // or "manual" or "slack_command" "clickup": { // Custom fields on the ClickUp task that we read into the payload "required_fields": [ "client_id", "client_name", "business_address", "gbp_url", "website_url", "service_area_cities", "seed_keywords", "competitors" ], "optional_fields": ["priority_services_ranked", "business_usps"] } }, // What triggers a run "triggers": { "clickup_webhook": { "enabled": true, "list_id": "<LIST_ID_FOR_SENTINEL_TASKS>", "trigger_status": "Ready (Automate)" }, "slack_command": { "enabled": false }, // v2 "cron": { "enabled": false }, // v2 "manual_cli": { "enabled": true } }, // How the user.message is constructed "kickoff_template": { "prompt": "Run a full Local SEO audit for this client. Execute all 5 modules. Write structured output to /mnt/session/outputs/sentinel-audit-{client_id}.json and .md.\n\nCLIENT PAYLOAD:\n{payload_json}" }, // Timeouts "limits": { "max_session_duration_minutes": 60, "max_session_cost_usd": 5.00, "alert_if_cost_exceeds_usd": 3.00 }, // Where outputs go "delivery": { "slack": { "enabled": true, "channel": "#seo-automation", "post_summary": true, "post_links_to_outputs": true }, "clickup": { "enabled": true, "update_task_status": "Review", "post_comment": true, "attach_output_files": true }, "google_drive": { "enabled": true, "folder_template": "/Clients/{client_name}/Audits/", "file_name_template": "sentinel-audit-{timestamp}" } }, // Alerting "alerts": { "on_failure_channel": "#seo-automation-alerts", "tag_on_failure": ["@trung"] } }
✨ Why this schema pays off

To add Content Catalyst later, Trung drops in ./agents/catalyst.json with different agent_id, kickoff prompt, required ClickUp fields, and delivery channels. Zero code changes to the dispatcher. Same for Revenue Relay, Ad Arbitrage, Build Bot. The abstraction pays for itself on the second agent.

6 Trigger sources How a run starts

Manual CLI v1

# Primary day-1 trigger for Trung + SEO Lead dogfooding ./bin/run-agent sentinel \ --client shamrock-detailing-columbus \ --clickup-task tk_abc123 # optional: resolves payload from task # Or with inline payload (for T1 synthetic client) ./bin/run-agent sentinel --payload-file ./fixtures/synthetic-client.json # Prints: session ID, live SSE stream output, final deliverable paths # Exits: 0 on success (session ended with end_turn), non-zero on failure

ClickUp webhook v1

ClickUp fires a webhook on task status change. Our endpoint receives it, validates, and triggers the run.

# HTTP endpoint POST https://orch.seonavigator.online/webhook/sentinel Content-Type: application/json X-Signature: <ClickUp HMAC signature> { "event": "taskStatusUpdated", "task_id": "tk_abc123", "history_items": [{ "after": { "status": "Ready (Automate)" } }] }

Handler responsibilities

  1. Verify HMAC signature using shared secret. Return 401 on mismatch.
  2. Idempotency check — has this task_id + status_change_id already been processed? If yes, return 200 immediately (don't re-run).
  3. Resolve agent config — URL path tells us which agent (/webhook/sentinelsentinel.json).
  4. Fetch task details from ClickUp API. Extract required_fields from task custom fields. If any missing, post error to Slack, update task status to "Blocked", return 200.
  5. Build kickoff payload, dispatch to session creation logic.
  6. Return 200 to ClickUp within 3s (webhook timeout). The actual session runs asynchronously — we acknowledge fast, work slow.
⚠️ Webhook must return fast

ClickUp webhooks time out at ~3 seconds and retry 3 times on failure. Don't wait for the full session to complete before responding. Return 200 immediately, run the session in a background promise. If the session fails later, alert via Slack, not via HTTP response.

Future triggers (v2)

Both follow the same dispatcher pattern — just different payload sources.

7 Session lifecycle state machine The spine of the dispatcher

CREATED ──▶ SENDING_KICKOFF ──▶ STREAMING ──▶ FETCHING_OUTPUTS ──▶ DELIVERING ──▶ DONE │ ├──▶ REQUIRES_ACTION (v2: HITL gate) ──▶ AWAITING_HUMAN ──▶ STREAMING │ ├──▶ BUDGET_EXCEEDED ──▶ TERMINATED │ └──▶ ERROR ──▶ ALERTED
StateTrigger event (from Anthropic SSE)Dispatcher action
CREATEDAfter POST /v1/sessions returns 201Record session_id + trigger_id in SQLite. Move to SENDING_KICKOFF.
SENDING_KICKOFFPOST user.message event with kickoff prompt. Move to STREAMING.
STREAMINGAny agent.* eventLog event to logs/sessions/{id}.jsonl. Extract text from agent.message events.
STREAMINGsession.status_idle with stop_reason: end_turnClose stream. Move to FETCHING_OUTPUTS.
STREAMINGsession.status_idle with stop_reason: requires_actionv1: log unexpected, alert. v2: move to REQUIRES_ACTION, route to Slack approval.
FETCHING_OUTPUTSGET /v1/files?scope_id={session_id} with Files beta header. Download each to ./tmp/{session_id}/. Move to DELIVERING.
DELIVERINGExecute each enabled delivery channel in config (Slack, ClickUp, Drive). Collect errors but don't abort.
DONEUpdate SQLite run log with final status + cost. Slack success message. Clean up tmp files.
ERRORException thrown anywhereLog full error + session context. Slack alert to #seo-automation-alerts tagging @trung. Attempt to mark ClickUp task "Blocked".
BUDGET_EXCEEDEDCost sampler detects session exceeds max_session_cost_usdSend user.interrupt event to halt agent. Move to TERMINATED. Alert Slack.
💡 Cost sampling for budget exceeded

Every ~30 seconds during STREAMING, fetch the session via GET /v1/sessions/{id} and compute cost from usage.input_tokens / output_tokens / cache_read_input_tokens. If projected exceeds alert_if_cost_exceeds_usd, Slack warning. If exceeds max_session_cost_usd, interrupt the session. Protects against runaway loops.

8 SSE event handling The event loop

Critical rules from the Managed Agents docs

  1. Send the user event BEFORE opening the stream. The API buffers events until a stream attaches. If you open the stream first, you may miss the initial burst.
  2. Always check session.status_idle events for stop_reason. end_turn = done. requires_action = session paused waiting on you.
  3. Reconnect on stream drop. Network blip shouldn't kill a 30-minute session. On connection error, wait 2s and reopen the stream — Anthropic buffers events so you don't lose them.
  4. Log every event. Post-mortem debugging is impossible without the event log.

Event types we care about in v1

Event typeWhat to do
agent.messageExtract content[].text. Write to console in CLI mode. Nothing else — final output is in files, not messages.
agent.tool_useLog tool name + input summary. Useful for cost debugging. No action.
agent.mcp_tool_useLog MCP name + tool + input. Same — observability only.
agent.custom_tool_usev1: we have no custom tools. If this fires, it's a bug — alert.
session.status_idleCheck stop_reason. Drive state machine.
session.thread_createdv1: single-thread only. If this fires, log. v2: multi-agent fanout handling.
span.*Telemetry. Log for debugging. Ignore in v1 logic.

Minimal handler implementation (TypeScript)

// lib/dispatcher.ts — core event loop (abbreviated) import Anthropic from '@anthropic-ai/sdk'; export async function runAgentSession(config: AgentConfig, payload: ClientPayload) { const anthropic = new Anthropic(); // 1. Create session const session = await anthropic.beta.sessions.create({ agent: process.env[config.anthropic.agent_id_env]!, environment_id: process.env[config.anthropic.environment_id_env]!, title: `${config.display_name} · ${payload.client_name}` }, { headers: { 'anthropic-beta': config.anthropic.beta_headers.join(',') } }); await db.recordRunStart({ session_id: session.id, agent: config.agent_name, payload }); // 2. Send kickoff BEFORE streaming const kickoffText = renderTemplate(config.kickoff_template.prompt, { payload_json: JSON.stringify(payload, null, 2), client_id: payload.client_id }); await anthropic.beta.sessions.events.create(session.id, { events: [{ type: 'user.message', content: [{ type: 'text', text: kickoffText }] }] }); // 3. Stream events with reconnect let done = false; const startedAt = Date.now(); while (!done) { try { const stream = await anthropic.beta.sessions.stream(session.id); for await (const event of stream) { await appendEventLog(session.id, event); switch (event.type) { case 'agent.message': logger.info({ session: session.id, text: extractText(event) }, 'agent.message'); break; case 'agent.tool_use': case 'agent.mcp_tool_use': logger.debug({ session: session.id, tool: event.name }, 'tool_use'); break; case 'session.status_idle': if (event.stop_reason?.type === 'end_turn') { done = true; } else if (event.stop_reason?.type === 'requires_action') { // v1: not expected, alert and halt await alertSlack(config, session.id, 'Unexpected requires_action in v1'); done = true; } break; } // Budget check every event (cheap) if (Date.now() - startedAt > config.limits.max_session_duration_minutes * 60_000) { await interruptSession(anthropic, session.id, 'duration_exceeded'); done = true; } } } catch (err) { logger.warn({ err, session: session.id }, 'stream dropped, reconnecting in 2s'); await sleep(2000); } } // 4. Fetch outputs + deliver await deliverOutputs(config, session.id, payload); await db.recordRunComplete(session.id); }

9 Output routing Where deliverables go

The flow

  1. List session files: GET /v1/files?scope_id={session_id} with files-api-2025-04-14 beta.
  2. Download each file via GET /v1/files/{id}/content. Save to ./tmp/{session_id}/.
  3. For each enabled delivery channel in config.delivery, execute in parallel. Collect results.
  4. Post a final summary to the main Slack channel with links to all deliverables.
  5. Clean up tmp files.

Channel handlers (abbreviated)

ChannelWhat it doesFailure mode
Google DriveUpload files to folder resolved from template (/Clients/{client_name}/Audits/). Create folder if missing. Record returned web_view_link for linking in other channels.Log error, continue with other channels. Don't block Slack/ClickUp on Drive failure.
ClickUpPost task comment with run summary + Drive links. Update task status per config. Attach output files directly to the task.Retry once on 5xx. On persistent fail, Slack alert.
SlackPost to #seo-automation: structured message with client name, agent name, run duration, cost, 3-bullet summary extracted from markdown output, link to Drive audit, link to ClickUp task.Retry once. Slack is usually reliable.

Slack message format (design)

╭────────────────────────────────────────────────╮ │ 🛰️ SEO Sentinel audit complete │ │ │ │ Client: Shamrock Detailing (Columbus, OH) │ │ Duration: 23m 14s · Cost: $1.82 │ │ │ │ ⭐ Overall Score: 71/100 │ │ │ │ Top 3 priorities: │ │ • P1: Add 8 missing GBP services │ │ • P1: Fix NAP inconsistency in 6 directories │ │ • P2: Build 5 city pages for service area │ │ │ │ 📄 Full audit: [Drive link] │ │ 📋 ClickUp task: [Task link] │ │ 🔍 Session trace: [Console link] │ ╰────────────────────────────────────────────────╯

10 Error handling & retries

FailureDetectionResponseRetry?
ClickUp webhook signature invalidHMAC mismatchReturn 401, log source IPNo
ClickUp task missing required fieldsConfig schema validation on task fetchPost clear error to Slack with list of missing fields. Update task status → "Blocked". Return 200 to webhook.No (human fixes fields + re-triggers)
Anthropic session creation 5xxHTTP statusRetry 3x with exponential backoff (1s, 4s, 16s). Log to SQLite run table.Yes
Anthropic 4xx on session creationHTTP statusLog, alert. Don't retry — it's our bug (bad agent_id, quota, etc.)No
SSE stream dropped mid-runConnection error on for awaitWait 2s, reopen stream. Anthropic buffers events, so we don't lose anything.Yes (transparent)
Agent session exceeds budgetCost samplerSend user.interrupt, mark session TERMINATED, Slack alertNo
Agent session exceeds durationWall clockSame as budget exceededNo
Output file download failsFiles API errorRetry 3x. If persistent fail, Slack alert with session ID so human can manually fetch from Console.Yes
Slack/ClickUp/Drive delivery failsAPI error per channelRetry once. Collect all delivery errors, post combined failure summary. Outputs remain in tmp/ until manually cleared.Yes (once)
SQLite write failurebetter-sqlite3 throwsLog to file. Alert. Run continues — state is reconstructable from event log.No

11 Idempotency

🔁 Why this matters

ClickUp webhooks retry on non-2xx response. Without idempotency, a slow response + retry can spawn duplicate sessions — two parallel Sentinel runs on the same client, wasting ~$3.50 and producing conflicting deliverables.

Idempotency key derivation

For webhooks: SHA256(agent_name + task_id + status_change_timestamp). Stored in idempotency_keys SQLite table with 24h TTL.

Flow

  1. On webhook arrival, compute key.
  2. INSERT OR IGNORE into table with key + session_id (or null if still creating).
  3. If INSERT succeeded → new run, proceed. If IGNORED → duplicate, return 200 with {already_processed: true, original_session: ...}.
  4. After session creation succeeds, UPDATE row with session_id.
// Simplified const key = sha256(`${agent_name}:${task_id}:${status_change_ts}`); const inserted = db.prepare( `INSERT OR IGNORE INTO idempotency_keys (key, created_at) VALUES (?, ?)` ).run(key, Date.now()); if (inserted.changes === 0) { logger.info({ key }, 'duplicate webhook, ignoring'); return res.json({ already_processed: true }); } // ... proceed with session creation

12 Logging & monitoring

Three log streams

  1. Application log (pino) → stdout → journalctl. Structured JSON. Level: info in prod, debug if DEBUG=1.
  2. Session event log./logs/sessions/{session_id}.jsonl. Raw SSE events. 90-day retention, cron cleans up older.
  3. Run history → SQLite runs table: session_id, agent, trigger_source, payload_hash, started_at, ended_at, status, total_cost, error.

Monitoring — day-1 minimum

Monitoring — later (v2+)

Grafana + Prometheus feels excessive for a 1-service VPS. Revisit when we have >3 agents + high volume.

13 HITL gate handler Stub for v2

🚧 Intentionally deferred

Sentinel v1 is read-only — it audits, it doesn't publish anything. So no Human-In-The-Loop gates are required. HITL becomes critical in Sprint 2 when PM Pulse coordinates agents that produce client-facing deliverables (email drafts, GBP posts, published pages).

Design for when we add it

When the agent fires agent.tool_use with a tool that requires confirmation (configured in the Anthropic agent definition), the session emits session.status_idle with stop_reason: requires_action and requires_action.event_ids[] listing the blocking events.

Our dispatcher's REQUIRES_ACTION state handler will:

  1. Post a Slack interactive message with Approve / Deny buttons and the tool call context.
  2. Store pending session state in SQLite.
  3. Return control (release the stream connection).
  4. When human responds via Slack interaction, POST user.tool_confirmation event with result: allow or deny + optional deny_message.
  5. Reopen stream, resume STREAMING state.

For v1: stub that logs unexpected requires_action + Slack alerts. Full implementation in Sprint 2.

14 VPS setup & deployment

# One-time VPS provisioning — Ubuntu 24.04 LTS on Hetzner CX22 # Base apt update && apt upgrade -y apt install -y build-essential git curl ufw fail2ban # Node 20 LTS curl -fsSL https://deb.nodesource.com/setup_20.x | bash - apt install -y nodejs # Firewall ufw allow 22/tcp ufw allow 80/tcp ufw allow 443/tcp ufw enable # User for the service (not root) useradd -m -s /bin/bash orch usermod -aG sudo orch # temp, remove after initial setup # Caddy reverse proxy for HTTPS apt install -y debian-keyring debian-archive-keyring apt-transport-https curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | tee /etc/apt/sources.list.d/caddy-stable.list apt update && apt install -y caddy # As user 'orch': su - orch cd ~ git clone <repo> orchestrator cd orchestrator npm ci cp .env.example .env # edit .env with secrets (ANTHROPIC_API_KEY, SENTINEL_AGENT_ID, etc.) chmod 600 .env npm run build # As root: exit # back to root # systemd service cat > /etc/systemd/system/orchestrator.service <<'EOF' [Unit] Description=SEO Navigator Agent Orchestrator After=network.target [Service] Type=simple User=orch WorkingDirectory=/home/orch/orchestrator EnvironmentFile=/home/orch/orchestrator/.env ExecStart=/usr/bin/node dist/server.js Restart=always RestartSec=5 StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable --now orchestrator systemctl status orchestrator # verify running # Caddy config cat > /etc/caddy/Caddyfile <<'EOF' orch.seonavigator.online { reverse_proxy localhost:3000 encode gzip log { output file /var/log/caddy/access.log } } EOF systemctl reload caddy # auto-provisions Let's Encrypt cert # Done. Test: curl https://orch.seonavigator.online/health # {"status":"ok","uptime":42}

15 Project structure

orchestrator/ ├── src/ │ ├── server.ts # Express app, HTTP routes │ ├── dispatcher.ts # Core session lifecycle │ ├── config.ts # Load + validate agent configs │ ├── payload-resolver.ts # Build client payload from ClickUp/manual │ ├── delivery/ │ │ ├── slack.ts │ │ ├── clickup.ts │ │ └── drive.ts │ ├── db.ts # SQLite wrapper │ ├── logger.ts # pino setup │ └── util/ │ ├── idempotency.ts │ ├── sse-stream.ts │ └── cost-sampler.ts │ ├── agents/ # Per-agent configs — the extensibility point │ └── sentinel.json │ # Future: catalyst.json, revenue-relay.json, etc. │ ├── bin/ │ └── run-agent # CLI for manual triggers │ ├── fixtures/ │ └── synthetic-client.json # T1 test payload │ ├── logs/ # Runtime — gitignored │ └── sessions/ │ ├── data/ │ └── orchestrator.db # SQLite — gitignored │ ├── migrations/ │ └── 001_initial_schema.sql │ ├── tests/ │ ├── dispatcher.test.ts │ └── idempotency.test.ts │ ├── .env.example ├── package.json ├── tsconfig.json └── README.md

16 Environment variables

# .env.example — Trung customizes this on the VPS # Anthropic ANTHROPIC_API_KEY=sk-ant-... SENTINEL_AGENT_ID=agent_abc123 SEONAV_ENV_ID=env_xyz789 # ClickUp CLICKUP_API_TOKEN=pk_... CLICKUP_WEBHOOK_SECRET=<shared secret for HMAC> CLICKUP_WORKSPACE_ID=9018614428 # Slack SLACK_BOT_TOKEN=xoxb-... SLACK_SIGNING_SECRET=... # for v2 slash command HMAC SLACK_ALERT_CHANNEL_ID=C... SLACK_RESULTS_CHANNEL_ID=C... # Google Drive (service account or OAuth) GOOGLE_SERVICE_ACCOUNT_JSON_PATH=/home/orch/orchestrator/gdrive-sa.json GOOGLE_DRIVE_ROOT_FOLDER_ID=1abc... # Service config PORT=3000 NODE_ENV=production LOG_LEVEL=info DATABASE_PATH=/home/orch/orchestrator/data/orchestrator.db SESSION_LOG_DIR=/home/orch/orchestrator/logs/sessions # Optional: cost alert thresholds (fallback if not in agent config) DEFAULT_MAX_SESSION_COST_USD=5.00 DAILY_COST_ALERT_THRESHOLD_USD=30.00

17 Test plan

Unit tests (Vitest)

Integration tests (run against Anthropic)

End-to-end (alongside Sentinel T1)

18 Definition of Done

✅ Orchestration harness v1 ships when
  1. Service running on VPS under systemd with auto-restart, HTTPS via Caddy.
  2. /health endpoint returns 200.
  3. agents/sentinel.json exists and passes schema validation.
  4. CLI trigger (./bin/run-agent sentinel ...) works end-to-end with synthetic payload.
  5. ClickUp webhook trigger works end-to-end on a test task.
  6. All 7 integration tests (OT1–OT7) pass.
  7. Successful Sentinel T1 run → Slack post received, ClickUp task commented, Drive file visible.
  8. Failed run injection (invalid agent ID) → Slack alert to #seo-automation-alerts tagging @trung.
  9. README in repo documents: how to deploy, how to add a new agent, how to read logs, how to rotate keys.
  10. SQLite runs table + event log working. Can query SELECT * FROM runs WHERE agent='sentinel' ORDER BY started_at DESC and see history.
  11. Jake + Trung have walked through the deployment together once (knowledge transfer).

19 Open questions

Things I don't know for certain — verify before committing.

  1. ClickUp webhook HMAC algorithm and header name. I've assumed X-Signature with HMAC-SHA256 but Trung should verify from ClickUp API docs. Blocker for: webhook security. Low risk — just look it up.
  2. Session reconnect behavior after long disconnect. If the VPS reboots mid-session, can we reconnect to the stream on recovery? Or is the session lost? Docs suggest events persist server-side. Worth verifying with a deliberate test. Not blocking v1 — rare edge case.
  3. Concurrent session limit per org. Rate limits state 60/min for create endpoints, but is there a cap on concurrent running sessions? We'd only hit it with high volume, but worth knowing. Ask Anthropic sales.
  4. ClickUp API rate limits when reading task details. Free tier is stricter. Verify our tier allows enough reads. Non-blocker — upgrade plan if needed.
  5. How Google Drive service account permissions propagate to shared drives. If client folders live in a shared drive, the SA needs explicit membership. Verify first client's folder setup before assuming this works. Non-blocker — standard Drive setup question.
  6. Slack bot token scopes needed. chat:write and files:write at minimum. Interactive message support needs interactivity in app config (v2 only).

20 Risks & mitigations

RiskSeverityMitigation
VPS goes down; webhooks fail silentlyHighUptime monitor (UptimeRobot free) hits /health every 5min. SMS Trung on 3 consecutive fails. ClickUp will retry 3x; gives us ~15min to recover before data loss.
Webhook signature secret leaks (e.g., committed to git)HighPre-commit hook scans for CLICKUP_WEBHOOK_SECRET pattern. Rotate on suspicion. Use .env with chmod 600, gitignored.
Runaway session burns through budgetMedPer-session cost cap + interrupt. Daily total cost alert. Both enforced client-side.
Duplicate sessions from webhook retriesMedIdempotency (Section 11). Tested as part of OT6.
Stream disconnect causes partial delivery (Slack posted, Drive not uploaded)MedIdempotent delivery: each channel writes a completion marker to SQLite. Re-run of delivery phase is safe.
Disk fills from session event logsLowDaily cron cleans logs older than 90 days. Disk usage alert at 80%.
Anthropic API outageLowSession creation retries with backoff. If still failing, Slack alert + ClickUp task stays in "Ready" — human re-triggers when up.
Breaking change in Managed Agents betaLow-MedPin SDK version. Subscribe to Anthropic release notes. Integration tests run weekly to catch drift.
Adding a second agent reveals that config schema is insufficientMedDesign schema from Sentinel + at least think through what Content Catalyst will need. Explicit v1.0 version field on config so we can handle migrations.
Trung as single owner (bus factor)MedREADME must be detailed enough that a competent dev could take over. Walk Jake through the codebase once at Sprint 1 end.

21 Appendix — reference code

A1. package.json (minimal)

{ "name": "seonav-orchestrator", "version": "1.0.0", "type": "module", "scripts": { "build": "tsc", "start": "node dist/server.js", "dev": "tsx watch src/server.ts", "test": "vitest", "migrate": "node dist/migrate.js" }, "dependencies": { "@anthropic-ai/sdk": "latest", "express": "^4.21.0", "better-sqlite3": "^11", "pino": "^9", "dotenv": "^16", "zod": "^3", // config schema validation "commander": "^12", // CLI parsing "googleapis": "^140", // Drive "@slack/web-api": "^7" }, "devDependencies": { "typescript": "^5", "tsx": "^4", "vitest": "^1", "@types/node": "^20", "@types/express": "^4" } }

A2. SQLite schema

-- migrations/001_initial_schema.sql CREATE TABLE runs ( session_id TEXT PRIMARY KEY, agent_name TEXT NOT NULL, trigger_source TEXT NOT NULL, -- 'webhook' | 'cli' | 'slack' | 'cron' client_id TEXT, payload_hash TEXT, started_at INTEGER NOT NULL, -- unix ms ended_at INTEGER, status TEXT NOT NULL, -- see state machine total_cost_usd REAL, error_message TEXT, metadata_json TEXT ); CREATE INDEX idx_runs_agent_started ON runs(agent_name, started_at DESC); CREATE INDEX idx_runs_client ON runs(client_id); CREATE TABLE idempotency_keys ( key TEXT PRIMARY KEY, session_id TEXT, created_at INTEGER NOT NULL ); -- TTL cleanup: DELETE FROM idempotency_keys WHERE created_at < ? (24h ago) -- Run daily via cron. CREATE TABLE delivery_attempts ( id INTEGER PRIMARY KEY AUTOINCREMENT, session_id TEXT NOT NULL, channel TEXT NOT NULL, -- 'slack' | 'clickup' | 'drive' status TEXT NOT NULL, -- 'success' | 'failed' response TEXT, attempted_at INTEGER NOT NULL, FOREIGN KEY (session_id) REFERENCES runs(session_id) );

A3. Health endpoint

// src/server.ts (excerpt) app.get('/health', (_, res) => { res.json({ status: 'ok', uptime: process.uptime(), version: pkg.version, agents_loaded: Object.keys(loadedConfigs), db_ok: db.prepare('SELECT 1').get() !== undefined }); });
📚 References