# Extension Debug Stream UI Plan (EE Live Debug Console for Runner-Based Extensions) ## Overview Introduce a first-class "Extension Debug Console" in Alga PSA EE that enables extension authors and internal engineers to observe live stdout/stderr and structured logs for their extensions, scoped to specific extension installs and request flows. The console will: - Stream debug events (stdout, stderr, structured logs) in near real time via WebSockets or Server-Sent Events (SSE). - Correlate events with: - `extension_id`, `tenant_id`, `install_id` - `request_id` - `content_hash` / `version_id` - Respect multi-tenant boundaries, capabilities, and security requirements. - Be enabled and heavily constrained in dev/staging; opt-in and time-boxed in production. This plan builds on the Wasmtime/component-based runner and the existing extension metadata + capability model. ## Goals - [x] Provide a dedicated EE UI page for extension debugging with live log streaming. *(UI at `server/src/app/msp/extensions/[id]/debug/page.tsx` consuming `/api/ext-debug/stream`.)* - [x] Allow filtering by: - Specific request flow (`request_id`), - Extension/install, - Stream type (stdout, stderr, structured logs). *(Implemented via client-side filters in the debug page.)* - [x] Implement a runner-side debug event pipeline that captures guest stdout/stderr and host logging events in a structured and correlatable way. *(Runner emits Redis Stream events when `RUNNER_DEBUG_REDIS_URL` is set; proxied by `server/src/app/api/ext-debug/stream/route.ts`.)* - [ ] Enforce strong authorization and isolation: only appropriate users can see logs for a given tenant/extension. - [ ] Gate the feature with environment flags and capabilities to avoid accidental leakage or resource abuse. Status update (2025-11-21): Streaming path (Runner → Redis Streams → SSE endpoint → UI) works; authz/capability gating and production hardening remain to be delivered. ## Non-Goals - Full-blown distributed tracing across all platform components. - Long-term persistent log storage or historical search UI. - Arbitrary tailing of all runner logs for all tenants from EE. - Overriding the structured provider-based logging model (this feature complements it). ## Architecture ### 1. Debug Event Model Define a structured debug event that the runner produces for any debuggable signal (stdout/stderr lines, extension log calls, critical host events): ```ts type ExtDebugEvent = { ts: string; // ISO 8601 level: 'trace' | 'debug' | 'info' | 'warn' | 'error'; stream: 'stdout' | 'stderr' | 'log'; tenantId?: string; extensionId?: string; installId?: string; requestId?: string; versionId?: string; contentHash?: string; // Raw or structured content message: string; fields?: Record; // Safety/limits truncated?: boolean; }; ``` Key rules: - Always include `extensionId` and `requestId` when available. - Prefer including `tenantId` and `installId` for multi-tenant visibility and auth decisions. - `message` is bounded in length; large payloads are truncated with `truncated=true`. - No secrets: message content must not include decrypted secrets; rely on existing capabilities and filters. ### 2. Runner: Capturing stdout/stderr and Logs Implement capture and routing inside the runner (Rust): - Location: - [`ee/runner/src/engine/loader.rs`](ee/runner/src/engine/loader.rs) - [`ee/runner/src/engine/host_api.rs`](ee/runner/src/engine/host_api.rs) (for WIT logging interfaces) - New module: `ee/runner/src/engine/debug.rs` or `ee/runner/src/util/debug_stream.rs` for shared plumbing. Core behaviors: 1. When instantiating a component for execution: - Initialize `HostExecutionContext` with: - `request_id`, `tenant_id`, `extension_id`, `install_id`, `version_id`, config, providers (already present conceptually). - Attach WASI stdout/stderr to custom sinks that: - Split by line or chunk. - Build `ExtDebugEvent` records with `stream: 'stdout' | 'stderr'`. - Dispatch to: - `tracing` (with target `ext.stdout` / `ext.stderr`), - The Redis publisher (see next section) when debug streaming is enabled. 2. For host-side WIT log functions (e.g. `alga.log` provider): - Generate `ExtDebugEvent` with `stream: 'log'` and appropriate `level`. - Dispatch similarly via `tracing` and the Redis publisher (when enabled). 3. Configuration: - Env flags: - `RUNNER_DEBUG_REDIS_URL` - `RUNNER_DEBUG_REDIS_STREAM_PREFIX` - `RUNNER_DEBUG_REDIS_MAXLEN` - `RUNNER_DEBUG_MAX_EVENT_BYTES` (per event cap) - Behavior when `RUNNER_DEBUG_REDIS_URL` is unset: - Continue emitting to `tracing` only (no debug stream fan-out). ### 3. Runner: Redis Debug Stream Publisher Instead of an in-memory hub, the runner now serializes each `ExtDebugEvent` and appends it to a Redis Stream. Key points: - Stream naming: `${RUNNER_DEBUG_REDIS_STREAM_PREFIX}{tenantId}:{extensionId}` (tenant falls back to `unknown` when unavailable). - Command: `XADD MAXLEN ~ field value ...` with a small bounded payload. - Each message includes the fields consumed by EE (`ts`, `level`, `stream`, `tenant`, `extension`, `install`, `request`, `version`, `content_hash`, `message`, `truncated`). - If Redis is down, we log and drop events (mirroring logs via `tracing` so operators can still inspect pod logs). - Future back-pressure: consider local ring buffer to avoid blocking extension execution if Redis is temporarily unavailable. Security note: - Redis credentials are provided via `RUNNER_DEBUG_REDIS_URL` (or a mounted secret). ACLs should scope the runner to `XADD` only for the debug keyspace. ### 4. EE Backend: WebSocket/SSE Proxy Add an EE API endpoint that exposes a controlled live debug stream to authenticated users. Suggested route: - `ee/server/src/app/api/ext-debug/stream/route.ts` (Next.js App Router) - URL example: - `/api/ext-debug/stream?extensionId=...&tenantId=...&installId=...&requestId=...` Behavior: 1. Authentication: - Require standard session auth. - Confirm user has one of: - Internal operator role, or - Tenant admin for `tenantId`, or - Extension owner / partner developer tied to the specified extension/install. - Deny if user attempts to observe another tenant’s data. 2. Authorization: - Check: - The requested `extensionId` belongs to the caller’s accessible scope. - If `tenantId` is provided, it matches caller’s tenant context (unless internal). - Optional: extension manifest/capabilities include something like `cap:debug.logs` or a server-side allowlist for debug streaming. 3. Subscription handshake: - On connection: - Build a subscription filter object: - Always include `extensionId`. - Include `tenantId` / `installId` if supplied. - Include `requestId` if provided for per-flow focus. - Call runner internal API or RPC: - e.g., `POST /internal/runner/debug/subscribe` with filter and a signed token, - Runner returns `debug_session_id`. - Start a streaming loop that: - Pulls `ExtDebugEvent` from runner (via: - a streaming HTTP endpoint, - or a long-lived connection, - or a broker / message bus, depending on infra), - Forwards events to the client via WebSockets or SSE. 4. Transport details: - Recommended for simplicity: - SSE for first implementation: - One-way stream, simple to proxy. - Events framed as `data: { ...ExtDebugEvent... }\n\n`. - WebSockets if bidirectional control desired later: - e.g., changing filters, pausing, etc. 5. Limits and lifecycle: - Enforce: - Max session duration (e.g. 5–15 minutes; extendable). - Close stream when: - TTL exceeded, - User navigates away, - Runner cancels subscription. - Provide: - `x-debug-truncated: true` or event-level `truncated` when server-side limits hit. - Clear documentation in UI when data may be incomplete. ### 5. EE UI: Extension Debug Console Add a dedicated page that consumes the stream: Suggested route: - `/msp/extensions/[extensionId]/debug` - For internal operators: - Additional entry: `/ee/extensions/[extensionId]/debug` Features: - Filters: - Extension (from URL). - Tenant/install (dropdown or inferred from context). - Request mode: - “All requests” for that extension/install. - “Specific request” by `requestId`. - Stream viewer: - Connect/disconnect button. - Live log panel: - Color-coded: - stdout (neutral), - stderr (red), - structured logs (level-specific colors). - Shows timestamp and key metadata (tenant, install, req id). - Controls: - Pause/resume auto-scroll. - Toggle stdout/stderr/log. - Clear buffer. - DX helpers: - Show “How to correlate” help: - e.g., “Use `request_id` from extension errors or logs to narrow to a single flow.” - For dev: - Example snippet for extension authors: - `logInfo("debug marker: X")` usage, - explaining how it appears in the console. ### 6. Capabilities, Flags, and Safety To avoid accidental misuse: - Capability gating: - Optionally require a capability at install/manifest level: - `cap:debug.logs` or similar; when absent, EE refuses debug sessions for that extension except for privileged internal users. - Environment flags (runner + EE): - `RUNNER_DEBUG_REDIS_URL` - `RUNNER_DEBUG_REDIS_STREAM_PREFIX` - EE-side: - `EXT_DEBUG_UI_ENABLED` - Rate limiting: - EE API-level rate limits per user/tenant. - Runner-level caps on sessions and throughput. - Data retention: - By design, this feature is for *live* debugging: - Buffers are short-lived. - Persistent historical logs remain in standard infra (e.g. Loki/ELK) under operator control. ### 7. Implementation Phases #### Phase 1 — Runner Event Capture - [x] Implement `ExtDebugEvent` type and the Redis publisher in the runner. - Implemented in `ee/runner/src/engine/debug.rs` and `debug_redis.rs`. - [x] Route: - WIT log provider calls to event producer. - Implemented in `ee/runner/src/engine/host_api.rs` to forward `log_info/log_warn/log_error` into Redis. - WASI stderr wired to event producer (initial implementation). - Implemented in `ee/runner/src/engine/loader.rs` via a custom `stderr` pipe that forwards guest stderr lines into Redis when enabled. - (Optional stdout mirroring remains off by default to avoid noise; can be added later if needed.) - [ ] Add basic unit tests: - stdout/stderr captured and tagged with correct metadata. #### Phase 2 — Internal Streaming API (Legacy) - [ ] (Deprecated) The original SSE endpoint at `/internal/ext-debug/stream` has been removed now that Redis fan-out is the canonical path. - [x] Implement EE backend `/api/ext-debug/stream`: - Implemented at `server/src/app/api/ext-debug/stream/route.ts`: - AuthN + AuthZ via existing helpers. - Forwards `extensionId`/`tenantId`/`installId`/`requestId` filter to runner using `x-ext-debug-filter`. - Relays SSE stream response directly to clients. - [ ] Add integration tests / local harness: - Fake extension emitting stdout/structured logs. - Confirm events appear via `/api/ext-debug/stream`. #### Phase 3 — EE Debug Console UI - [x] Build `/msp/extensions/[extensionId]/debug` page: - Implemented at `server/src/app/msp/extensions/[extensionId]/debug/page.tsx`. - Connects to `/api/ext-debug/stream` using `EventSource`. - Supports filters for `tenantId`, `installId`, and `requestId`. - Renders a live console with: - stdout/stderr/log classification, - connection state, - auto-scroll toggle, - bounded history to avoid unbounded memory. - [x] Add navigation entry points: - Implement by linking from the extensions settings UI at `/msp/settings?tab=extensions`: - For each extension row, add a "Debug Console" action targeting: - `/msp/extensions/{extensionId}/debug` - Optionally preserve `tenantId`/`installId` in query params. - This hooks the existing settings-based extensions screen (the canonical management surface) directly into the debug page for the selected extension. - [x] Document how extension authors: - Inline help on the debug page explains: - Required runner configuration (`RUNNER_DEBUG_REDIS_URL`, stream prefix, Redis ACL credentials). - Using structured logging helpers instead of printing secrets. - Using `x-request-id` / `context.request_id` and filters to follow specific request flows. #### Phase 4 — Hardening & Production Policy - [ ] Add capability and tenant-scoped policy checks. - [ ] Add robust truncation, redaction (optional regex-based guardrails), and audit logs: - Who opened debug sessions, for which extension/tenant, and when. - [ ] Define environment policies: - Fully enabled in dev/staging. - In prod: - Off by default. - Can be enabled per tenant/extension with admin approval or for time-limited debugging. #### Phase 5 — Distributed Event Bus (Redis Streams) _Motivation: In production, Knative fans requests across runner pods. A Redis-backed fan-out ensures the debug console aggregates logs across all pods and avoids the Kourier routing issues we hit with `runner.msp.svc.cluster.local`._ - [ ] Provision a Redis cluster/namespace dedicated to short-lived “debug events” with strong authentication and TTL defaults (e.g., 15 min retention). - [ ] Define stream partitioning: e.g., `ext-debug:` or sharded by `tenant_id:extension_id`. Document key structure, retention policy, and serialization (JSON `ExtDebugEvent`). - [ ] Extend the runner: - Add optional `RUNNER_DEBUG_REDIS_URL`, `RUNNER_DEBUG_REDIS_STREAM_PREFIX`, `RUNNER_DEBUG_REDIS_MAXLEN` (and future TLS/password flags) env vars so operators can point at the shared Redis cluster without changing code. - On each event, enqueue to Redis Streams (or Pub/Sub) with bounded async buffering; Redis replaces the in-memory `DebugHub` entirely. - Tag events with a monotonic sequence (`xadd` ID) to preserve ordering. - Include metrics + back-pressure handling (drop oldest events, emit warnings) when Redis is unavailable. - [ ] Reuse existing Redis stream plumbing where possible: - `server/src/lib/eventBus/index.ts` already manages `XADD`/`XREADGROUP`, consumer groups, trimming, and retry logic. - `shared/workflow/streams/redisStreamClient.ts` shows how to wrap publish/read/ack helpers. - Mirror those patterns for debug streams (new `DebugStreamClient`) instead of reinventing connection management. - [ ] Build a lightweight “debug-stream fan-out” worker (can live inside the EE server or as a sidecar) that tails Redis Streams via consumer groups, applies filters server-side, and relays to subscribers. - [ ] Security: reuse existing `x-runner-auth` token for publishing auth, and create a dedicated Redis ACL role that only allows XADD/XLEN on the debug keys. #### Phase 6 — EE Proxy Migration to Redis Streams - [ ] Update `/api/ext-debug/stream` so that, when the Redis-backed mode is enabled (`EXT_DEBUG_STREAM_MODE=redis`), it: - Validates the user/session as before. - Registers/updates a consumer group per `extensionId` (e.g., `ee-debug-ui`). - Issues `XREADGROUP` with filters (`tenantId`, `installId`, `requestId`) applied server-side before emitting SSE events. - Implements heartbeats + idempotent acking so abandoned sessions don’t stall the stream. - [ ] Add multi-tenant scoping at the stream level by embedding tenant + install IDs in stream entries and filtering at the EE layer. - [ ] Provide fallbacks: if Redis is unavailable, drop back to the legacy per-pod proxy with an explicit warning in the UI (“live stream limited to a single runner pod”). - [ ] Update the debug console copy to explain that live events now aggregate across all runner replicas (when Redis mode is active). #### Phase 7 — Remove Per-Pod Dependency & Operability - [ ] Once the Redis path is proven in production, disable direct `/internal/ext-debug/stream` access from EE (keep it only for diagnostics). - [ ] Simplify runner configuration: require either Redis streaming _or_ a dedicated `runner-private` ClusterIP if Redis is disabled, so we do not rely on Kourier host matching. - [ ] Add observability: - Metrics for stream lag, consumer group backlog, dropped events. - Alerting when Redis retention drops events because of sustained back-pressure. - [ ] Document upgrade/rollback steps so operators can toggle between legacy and Redis-backed streaming without dropping all sessions. ## Dependencies & Coordination - Runner team: - Implement `ExtDebugEvent`, stdout/stderr capture, DebugHub, and internal streaming API. - EE server/gateway team: - Add `/api/ext-debug/stream` with proper auth. - Wire request_id propagation end-to-end. - Platform/Infrastructure: - Operate the Redis cluster/streams (Phase 5+) with appropriate ACLs, backups, and monitoring. - Expose a stable internal DNS name (or service) for the runner if Redis is disabled, so EE does not depend on Kourier host headers for intra-cluster calls. - DX/Docs: - Update `ee/docs/extension-system/development_guide.md` and related docs to include: - How to use the debug console. - Expected constraints and policies. - Security/compliance: - Review exposure model, logging content policies, and retention defaults. ## Summary This plan introduces a focused, auth-aware Extension Debug Console within EE that streams live debug events from the Wasmtime-based runner, scoped by extension/install/request. It is: - Concrete enough to implement incrementally. - Safe for multi-tenant environments when flags and capabilities are observed. - Highly valuable for extension authors who need real-time visibility into their code without direct infrastructure access.