Nyx | Overview

Reliability Overhaul · 2026-06-10

Where the rebuild stands today.

A multi-phase overhaul to make Nyx reliable enough to safely build itself. Each phase is verified (typecheck + tests green) and committed on the overhaul branch. Full context: the llms.txt and the rebuild report.

Code gate (build + test before done) The keystone. Runs the repo's build (typecheck) and tests in the worker's cwd and re-pends if either is red, before any browser verify. Catches the compile-error class of bug that used to ship broken code unnoticed.

Live

P5. Verify gate (per queue item) A successful worker no longer closes its item on exit; it must clear verification first. Pass closes, fail re-pends for an informed retry. Default-on.

Live

P4. Eyes visual-testing MCP Headless, deterministic screenshot / responsive / flow / audit / layout tools that return real images. Replaced the flaky stock Playwright MCP; the planner no longer recommends playwright.

Live

P3. Automatic concierge triage The concierge classifies each message and routes it: answer, one worker, plan, or plan-on-crack, with no manual slash command.

Live

P2. Efficiency Prompt caching, killed the per-turn auto-memory worker, right-sized the planner off the priciest default, skip the synthesizer on single-subtask plans, fan-out cache stagger.

Live

P1 + P10. Deep Research + keychain The Deep Research open-report button is fixed, and secrets moved off the macOS Keychain to an owner-only file vault, so the permission prompt never appears again.

Live

P6. Moderator supervisor layer Watches workers in flight, detects drift (stall / livelock / blocked), and resolves it autonomously without ever blocking the operator. Module built + tested; not yet wired into the fanout.

Module ready

P7. Context contract Structured briefs + a live context store so a worker deep in the chain keeps the goal and constraints. Module seeded; wiring (worker prefix, replace the boot-frozen snapshot, lift the memory cap) is next.

Partial

Per-plan isolation + Company mode Worktree-per-plan to stop contamination, then parallel moderators with role specialization (build / market / support / analyse), an objective decomposer, and goal-driven autonomy.

Memory Architecture

What we borrowed from Hermes, what we rebuilt for developers.

Nous Research's Hermes Agent shipped the best agent-memory design we have seen: bounded, frozen mid-session, prefix-cache friendly, and curated by the agent itself. It works beautifully for single-developer, single-machine, single-project use. We kept the parts that work, fixed the layers that break the moment you have more than one repo or more than one machine, and added two layers specific to coding agents.

Hermes strength we kept

Bounded core memory, frozen snapshot.

MEMORY.md (2,200 char cap) and USER.md (1,375 char cap) load into the system prompt at session start and stay immutable until the next session. Total persistent overhead under 1,300 tokens. This is what makes prefix caching work and what forces curation instead of context-stuffing.

Hermes weaknesses we fixed

Per-repo memory + multi-machine sync.

Hermes stores at ~/.hermes/ globally, so working on two projects fights for the 2,200 char budget and entries from one repo leak into another. We add PROJECT.md scoped per-repo (interops with existing CLAUDE.md). Multi-machine sync ships in Pro Cloud. Team namespacing in Team tier.

Five-layer memory stack

1. Core Memory (frozen snapshot) MEMORY.md + USER.md, hard character caps, loaded once per session and never re-fetched. Prefix-cache friendly. Identical to Hermes design, kept verbatim because it is correct.

Live

2. Project Memory (per-repo) PROJECT.md plus existing CLAUDE.md auto-loaded when a worker enters a repo. Scoped to the project root so 20 projects can each have their own budget without competing for one global cap.

Live

3. Session Memory (durable, searchable) Every conversation in SQLite. Not in the system prompt. Searched on demand. Long sessions never bloat context, but nothing is forgotten. FTS5 index is the next slice (item 23).

Partial

4. Skill Memory (self-authored procedures) Skills Librarian ships with ~/.orchestrator/skills/<name>.md + mcp__orchestrator__skill (list/get). Workers see all skill names in their prompt prefix and pull bodies on demand. The next slice (item 19): a reflection step where successful workers propose skill candidates, promoted on operator approval.

Live

5. Multi-Machine Sync Memory files sync across operator devices via encrypted server. Solves the Hermes single-machine constraint. Atlas (the Pro hosting product) is the planned home. Team tier adds per-user / per-team namespacing on top.

Pro Cloud

+ Soul file (identity, not memory) SOUL.md defines the Moderator's personality, risk tolerance, escalation rules, and communication style. Loaded at orchestrator boot, never modified per-session. Prevents behavioral drift across hundreds of runs.

Future

The tool surface (Hermes-shaped, dev-extended)

memory(action, scope, content) with actions add / replace / remove and scopes core / user / project. Substring matching for entries, same as Hermes. No read action because core memory is auto-injected.

session_search(query) for FTS over past sessions. Returns ranked snippets, not full transcripts. Cheap to call.

skill(action, name) for procedural memory. save captures a successful sequence; invoke replays it; list shows what is available.

Capacity warnings fire at 80% fullness on any bounded layer, prompting consolidation. Decay archives entries unused after 30 sessions. Per-repo memory archives with the repo when the project is removed, so old context never haunts new work.

Architecture

How the pieces fit together.

A Hono HTTP+WebSocket server (the Moderator) owns a SQLite database and a pool of spawned Claude Code subprocesses (the Workers). Workers receive a permission MCP config that round-trips every sensitive tool call back to the Moderator for the Push Gate to evaluate. The operator watches everything live through a WebSocket-backed dashboard, a terminal TUI rebuilt in the Claude Code style with a grouped queue, a dynamic worker grid, and streamed markdown, or a phone via the Discord bot and the ntfy notify daemon.

On top of that core, an autonomous conduction layer now runs continuously. Autopilot pulls work off the queue and dispatches it without an operator present; the Auto-Merge Daemon sweeps finished nyx-self and nyx-feat branches, runs the test suite, asks the Critic for a verdict, and merges only what passes; Sleep Mode lets the operator hand the orchestrator full autonomy for a fixed window. A Concierge backend answers operator questions with a persistent memory module over ~/.nyx/memory and full conversation history, and exposes that memory to workers as the nyx MCP server alongside a Codex MCP for image and fast-code tasks. A Telemetry surface reports tokens, shipped lines, mean time between failures, and worker success rate. Supervisor scripts (a launchd plist and a bash watchdog) keep the Moderator alive across crashes, and a one-command launcher starts, stops, and inspects the whole stack. Team-scoped audit tables are scaffolded for the multi-seat tier.

solid = shipped · dashed = planned · blue arrows = control flow between processes

Worked example: operator queues a feature

Trace a single request from the moment the operator types it through to a merged commit and a phone notification. Every step is live code today.

request trace

step 1 — operator

POST /queue { "title": "add rate-limit guard", "enrich": true }

Queue stores the item as pending. With enrich: true, a sonnet-low pass expands the title into a full spec (goal, files to touch, test requirements, branch name) stored in notes. Item gets a priority and lands in the queue table in SQLite.

step 2 — autopilot tick (60s)

QueueAutopilot polls: concurrent < max, item priority < 9999, status = pending.

Picks the highest-priority pending item. Calls manager.start() with the enriched spec as the prompt. Updates queue row to in_progress. WebSocket hub broadcasts autopilot_dispatched to all connected dashboards. Operator's TUI shows the item move from WAITING to WORKING.

step 3 — worker spawns

WorkerManager forks: claude --print --mcp-config /tmp/nyx-mcp.json --model claude-sonnet-4-6

The MCP config points the worker's permission requests back to POST /internal/can-use-tool. Worker creates branch nyx-feat/rate-limit-guard, reads affected files, writes tests first, implements, commits. Every tool call result streams to the event table. Watchdog resets the idle clock on each event.

step 4 — push gate fires

Worker calls git push origin nyx-feat/rate-limit-guard

Push is a Tier 2 tool. Matcher checks against blacklist.yml. Not on the blacklist, so it proceeds. Tier 1 floor (floor.ts) is never reached because the tool isn't hardcoded-blocked. Push succeeds. If the operator had set a blacklist rule for "push to protected branch", the gate would post a gate_pending event and block until the operator approves or denies via the dashboard.

step 5 — worker exits, auto-merge daemon picks up

Session status flips to succeeded. Auto-Merge Daemon's 60s scan sees nyx-feat/rate-limit-guard ahead of main.

Daemon runs the test suite. If tests pass, it asks the Critic worker to review the diff. Critic responds with approve/reject + comments. If approved, Daemon fast-forward merges (or no-ff if ff is impossible). Never pushes; operator pushes manually after audit. Queue row flips to done. QueueAutoSync detects the merge and syncs the status.

step 6 — notify

ntfy daemon sees the succeeded event. POST to ntfy.sh/topic.

Operator's phone buzzes. Title: "rate-limit-guard succeeded". Body: tokens used, session id, merge status. If Sleep Mode is on, all of steps 2-6 happened without the operator touching anything.

Final state, end to end

What a single request looks like when every phase is shipped. Left to right: a trigger fires (from any channel), the router classifies it, the Moderator clarifies and decomposes against live memory, workers execute in parallel across providers, reviewers verify, and the Memory Keeper persists everything that should outlive the session. Always-on processes (Watchdog, Scheduler, Autopilot) inject new triggers back into the flow without operator action.

all components active · flow proceeds left to right per request · always-on layer re-enters at the leftmost column

Workflow

What happens when you type "ship this fix" from your phone.

Operator sends intent

Vague natural-language request from dashboard chat, Discord DM, or HTTP API. No need to specify files, commands, or steps.

POST /sessions | ws: chat_send | discord webhook (planned)

Moderator clarifies (front-loaded), then decomposes

Reads the intent against repo context (CLAUDE.md, recent commits, memory layers). Asks any clarifying questions upfront, especially around UI scope, ambiguous goals, or unfamiliar repos. Once aligned, goes silent: produces a task list and decides one worker or several. No further interruptions until done or genuinely blocked.

opus-4-7 xhigh effort | intake -> aligned -> running

Workers spawn in parallel

Worker Manager spawns one Claude Code subprocess per task with a per-worker MCP config baking in the session and agent IDs. Output streams as JSON over stdout, parsed line-by-line.

child_process.spawn | stream-json | cap 8 concurrent

Worker hits a sensitive operation

Worker wants to Bash, Write, Edit, WebFetch, or Read a sensitive path. Permission MCP intercepts and POSTs to /internal/can-use-tool on the Moderator.

mcp__orchestrator__can_use_tool | permission-prompt-tool flag

Push Gate matches against your blacklist

Matcher checks the operation against your blacklist.yml and the hardcoded floor. Most things pass through instantly. Only paths or commands that match a blacklist pattern queue to push_gate_queue and block the worker. WebSocket event fires to your dashboard or phone.

push-gate/matcher.ts | blacklist.yml | ws topic: gate

Operator approves from any device

Dashboard shows a toast notification with the exact command and rule that matched. Approve or deny with one tap. Decision is written to DB, signal sent back to the waiting worker.

ws: gate_decide | decideGate(db, queueId, "approved" | "denied", "user")

Worker continues, Critic reviews (Phase 3+)

Approved tool call returns "allow" and the worker proceeds. When the worker finishes, a Critic agent reviews the diff (and for UI work, a Vision Critic screenshots the result and iterates).

phase 3 roadmap | opus-4-7 with vision input

Moderator reports back, Watchdog keeps an eye

Session status flips to succeeded. Dashboard updates live. Watchdog continues monitoring for runtime errors, CI failures, or drift signals that warrant a new task.

phase 6 roadmap | haiku-4-5 (cheap always-on)

Push Gate

Permissive by default. Blacklist what you don't want touched.

The Push Gate is not a deny-by-default trap door. The default is permissive: the agent can edit files, run shell commands, fetch the web, and use the system the way you would. You maintain a small blacklist of paths and command patterns that always require your approval, and there is a hardcoded floor of true-secret paths that nothing can override regardless of your config. Everything else runs at full speed without round-tripping for approval.

Tier 1 // Always free

Read and discovery tools

No round trip, no config, no questions. These cannot modify state or exfiltrate secrets.

Glob Grep Read NotebookRead WebSearch WebFetch Task

Tier 2 // Free unless blacklisted

Modify the system

Run by default. Round trips only when the target matches a pattern in your blacklist.yml.

Bash Write Edit NotebookEdit MCP tool calls

Tier 3 // Hardcoded block

True secrets and destructive ops

Always gated. Cannot be overridden by config. Protects you from your own typos.

~/.ssh/** ~/.aws/credentials ~/.gnupg/** sudo * rm -rf / curl | sh

Your blacklist.yml (you own this)

Lives at ~/.orchestrator/blacklist.yml globally, with per-repo overrides at <repo>/.orchestrator/blacklist.yml. Pattern syntax is glob for paths, shell-style for bash. Anything not listed runs without prompting.

# ~/.orchestrator/blacklist.yml
# Tier 2 operations are free by default. List patterns here that
# should ALWAYS require your approval, even though the file or
# command would otherwise be permitted.

paths:
  # File writes / edits that need approval
  - "**/.env*"
  - "**/.git/config"
  - "**/package.json"            # prevent silent dep bumps
  - "~/personal-notes/**"      # private journal
  - "~/projects/fbm-sniper/pro/src/license/**"

bash:
  # Commands that need approval
  - "git push *"                 # memory:no_push_default
  - "gh release create *"        # memory:no_public_release
  - "gh pr create *"
  - "fly deploy *"
  - "npm publish *"
  - "pnpm publish *"
  - "docker push *"

mcp:
  # Specific MCP tool calls that need approval
  - "mcp__*__delete_*"
  - "mcp__github__merge_pull_request"

Tier 3 is enforced regardless. Even if you accidentally blank your blacklist.yml, the agent still cannot read ~/.ssh/id_rsa, run sudo, or pipe untrusted shell scripts. The hardcoded floor exists so a misconfiguration cannot become a security incident.

Roles

One brain, many hands.

The system is a small society of specialized agents. Each has a different model, effort level, and tool budget. Cheap models do repetitive work. Opus 4.7 does the thinking. Vision-capable models do the seeing.

Moderator Two-phase. First clarifies: asks any questions it needs upfront (especially around UI scope, ambiguous goals, or unfamiliar repos). Then runs: dispatches workers and goes autonomous until done or genuinely blocked. Owns the session lifecycle.

Live

Worker Spawned Claude Code or Codex subprocess that performs the actual file edits, commands, and tests.

Live

Push Gate Matcher plus human-approval queue. Sits between every worker and every destructive operation.

Live

Critic Reviews completed worker output, requests revisions, blocks the session from being marked done if quality fails.

Phase 3

Vision Critic For UI work. Screenshots the dev server, compares against intent, asks the worker to iterate until visually correct.

Phase 3

Git Steward Manages git worktrees so multiple workers can edit different branches simultaneously without stepping on each other.

Phase 4

Debugger Triages errors from webhooks, CI runs, and runtime crashes. Decides whether to auto-fix, escalate, or ignore.

Phase 6

Memory Keeper Owns the five-layer memory stack. Decides what gets written to MEMORY.md vs PROJECT.md, enforces character caps, triggers consolidation at 80% fullness, archives unused entries after 30 sessions.

Phase 6

Skills Librarian Captures successful tool-call sequences as named, replayable skills. Inspired by Hermes Skills but auto-extracted from successful sessions instead of hand-authored.

Phase 6

Scheduler (Crons) Cron-style proactive execution. "Every morning at 8am, check FBM Sniper CI status." "Every hour, scan error webhooks." Shifts the system from reactive to proactive. OpenClaw / Hermes pattern.

Phase 2.5

Channel Router Routes operator input from any source (dashboard chat, Discord DM, email, webhook, voice) to the right session. One agent, many inboxes. OpenClaw pattern, adapted for developers.

Phase 5

Watchdog Always-on monitor. Detects stuck workers, drift in agent behavior, and external events that warrant a new session.

Phase 6

Self-Editor Reads the orchestrator's own codebase, proposes improvements, runs them through the same safety gates as any other worker. Includes Hermes-style self-improvement: learns from past corrections.

Phase 7

Roadmap

Eight phases shipped; routing layer now hybrid (LLM + deterministic).

Minimum Viable Moderator

Single worker, stream-JSON parsing, SQLite persistence, basic HTTP API and SSE. Foundation everything else builds on.

Shipped

02A

Backend: Push Gate + Multi-Agent + WebSocket

Permission MCP, matcher with sensitive-path detection, parallel workers with concurrency cap, WS hub with topic subscriptions, dev-server detection, per-worker MCP config files.

Shipped

02B

Dashboard

Vite + React 19 + Tailwind v4 + Sonner, mac chrome, multi-tab agents, collapsible chat, layout modes, settings modal, browser notifications, WS auto-reconnect, push-gate bar, audit-log drawer.

Shipped

Trust: Tier-1 floor + blacklist + audit

Hardcoded safety floor (cannot be overridden by any operator config). blacklist.yml hot-reload. Three-layer matcher with layer/rule reported on every decision. Universal audit log on every gate verdict.

Shipped

2.5

Error Ingestion + Scheduled Jobs

Webhook receiver for CI failures and runtime errors. Cron-style scheduled sessions with @every / @hourly / @daily patterns. Repo-specific priming (auto-load CLAUDE.md from any project root). FBM Sniper SRE use case unlocked.

Shipped

03a

Critic (code review agent)

POST /sessions/:id/review spawns a short-lived claude with strict reviewer prompt. Verdict: approve / revise / abstain plus critique up to 300 words. 90-second timeout. Manual-trigger first; auto-trigger on session.succeeded is a config flag.

Shipped

03b

Vision Critic (screenshot loop)

Screenshots the dev server, compares against intent, asks worker to iterate until visually correct. Needs Playwright MCP installed. Earned its slot only after the dashboard work demonstrated the cost of typo-driven UI churn.

Planned

Multi-page Fan-out + Git Steward

Git worktrees per worker so parallel agents can edit independent branches. Steward role coordinates merges and resolves conflicts. Low priority for backend-bot work; high impact for frontend.

Planned

Remote Access + Discord

Cloudflare Tunnel for dashboard access from anywhere. Discord bot for phone-driven operator input. DM the bot anything, message forwards to Nyx, worker reply DMs back. One persistent agent per Discord user.

Shipped

Resilience: Memory + Skills + Watchdog

Hermes-style frozen-snapshot memory (MEMORY.md + USER.md + PROJECT.md). Skills librarian with mcp__orchestrator__skill tool. Always-on Watchdog detecting stuck workers. The "SRE for your bot" layer.

Shipped

Self-Editor v1 (narrow, signal-driven)

Auto-dispatches a fix worker scoped to the orchestrator repo on worker_failed or watchdog_stuck signals. Branch-isolated, no auto-merge, no auto-push, protected files enforced. First autonomous self-modification completed (a 104-line tuning note on a fresh nyx-self/ branch). Self-iter now runs on opus-4-8/medium for diagnostic quality after the overnight 499-fire dribble-livelock proved Sonnet/low was undercooked.

Shipped

Hybrid routing: deterministic code for shape-known work

Autopilot used to spawn a Claude subprocess for every dispatch. After a $325/hr spend spike caused by Opus sessions doing $5-worth of git-loop work, the system grew a routing layer. isFanoutItem() detects "cherry-pick" / "fanout" titled items and routes them to /ops/fanout (a pure-git loop) instead of spawning a worker. recordWorkerOutcome appends every exit's outcome to the originating queue notes so retries are informed. Boot-time queue reconciler closes orphan in_progress rows that the live wiring missed. Auto-merge daemon retries flaky test runs before blocking. Mobile-first /m dashboard with browser-side STT replaces the legacy claude --remote-control concierge as the default phone surface.

Shipped

Comparison

How this differs from existing agentic tools.

Capability	This	Cursor	Devin	OpenHands	Aider	Hermes	OpenClaw
Multi-agent parallel	Yes	No	No	Limited	No	No	Sub-agents
Granular Push Gate	Yes	Per-tool	No	No	Per-edit	No	Plugin-level
Bounded frozen-snapshot memory	Yes	No	No	No	No	Yes (origin)	State store
Per-repo memory scoping	Yes	Project rules	No	Workspace	CONVENTIONS.md	Global only	Plugin state
Procedural skill capture	Yes	No	No	No	No	Hand-authored	Plugins
Scheduled / cron execution	Planned	No	No	No	No	Yes	Yes
Multi-channel input (Discord, email, webhook)	Planned	No	Slack	No	No	Limited	Yes (origin)
Always-on monitoring (Watchdog)	Planned	No	No	No	No	Crons only	Crons only
Phone / remote operator input	Planned	Cloud	Web	No	No	No	Telegram
BYO Claude Max / Pro (post-Apr 2026)	Yes (subprocess)	Yes	API only	Workarounds	Workarounds	API only	Blocked
Local-first (your machine)	Yes	Hybrid	Cloud	Yes	Yes	Yes	Yes
Source available	Yes (planned)	No	No	Yes	Yes	Yes	Yes
Per-role model selection	Yes	Per-mode	No	Per-session	Yes	Per-skill	Yes

"Origin" marks where an idea was pioneered. We borrow Hermes' frozen-snapshot memory and Skills design, and OpenClaw's multi-channel input and cron pattern, then extend each for the developer use case (per-repo scoping, multi-machine sync, official-CLI subprocess auth, push-gate safety).

Distribution Model

Tiered for teams. Free where it matters for adoption.

Community Edition runs single-machine with every role, the full Push Gate, Autopilot, Concierge, TUI, and the dashboard. Paid tiers add the infrastructure and collaboration features that engineering teams need.

Community Edition (free)

Everything that runs locally.

Moderator, all roles, multi-agent, Push Gate, dashboard, model-per-role configuration, Autopilot, Sleep Mode, Concierge, TUI, single-machine operation. Bring your own Claude Code subscription, no per-token billing.

Team + Enterprise (paid)

Infrastructure and control for teams.

Per-seat team licensing, custom-deployed for enterprise, multi-machine memory sync, encrypted audit log export, hosted Cloud Tunnel, priority support. BYO-Claude-Code-subscription is a first-class feature at every tier, no lock-in, no token markup.

Why we still work with Claude Pro / Max (post-April 2026)

In April 2026 Anthropic blocked Pro and Max OAuth tokens from working in third-party tools, breaking BYO-subscription auth for OpenClaw, custom harnesses, and most "Devin-but-mine" attempts. Those tools called the Anthropic API directly with the user's token, which Anthropic now refuses.

We do not call the API. We spawn the official claude CLI as a subprocess. The CLI is Anthropic's own client and retains full Pro / Max access. Our orchestrator never sees a token, never sends a request, never violates ToS. We just listen on the CLI's stdout and route its tool calls through the Push Gate.

Practical result: Claude Max users save real money. A Pro subscription that would cost hundreds in API equivalent stays at $20-200/mo flat. This is the single largest cost advantage we have over any post-April-2026 competitor that takes the API-key route.

Where the rebuild stands today.

Built where the on-call pain is real.

Current agents have three failure modes.

They ask too much

They go rogue

They work alone

Three layers that compose into a real operator.

Moderator clarifies, then runs.

Push Gate is permissive by default.

Multi-agent fans out

What we borrowed from Hermes, what we rebuilt for developers.

Bounded core memory, frozen snapshot.

Per-repo memory + multi-machine sync.

Five-layer memory stack

The tool surface (Hermes-shaped, dev-extended)

How the pieces fit together.

Worked example: operator queues a feature

Final state, end to end

What happens when you type "ship this fix" from your phone.

Operator sends intent

Moderator clarifies (front-loaded), then decomposes

Workers spawn in parallel

Worker hits a sensitive operation

Push Gate matches against your blacklist

Operator approves from any device

Worker continues, Critic reviews (Phase 3+)

Moderator reports back, Watchdog keeps an eye

Permissive by default. Blacklist what you don't want touched.

Read and discovery tools

Modify the system

True secrets and destructive ops

Your blacklist.yml (you own this)

One brain, many hands.

Workers can talk to anything that speaks MCP.

Eight phases shipped; routing layer now hybrid (LLM + deterministic).

Minimum Viable Moderator

Backend: Push Gate + Multi-Agent + WebSocket

Dashboard

Trust: Tier-1 floor + blacklist + audit

Error Ingestion + Scheduled Jobs

Critic (code review agent)

Vision Critic (screenshot loop)

Multi-page Fan-out + Git Steward

Remote Access + Discord

Resilience: Memory + Skills + Watchdog

Self-Editor v1 (narrow, signal-driven)

Hybrid routing: deterministic code for shape-known work

How this differs from existing agentic tools.

Boring, proven, no novel infrastructure.

Tiered for teams. Free where it matters for adoption.

Everything that runs locally.

Infrastructure and control for teams.

Why we still work with Claude Pro / Max (post-April 2026)