PSA/ee/docs/plans/2025-11-07-extension-debug-stream-ui-plan.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

365 lines
18 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Extension Debug Stream UI Plan (EE Live Debug Console for Runner-Based Extensions)
## Overview
Introduce a first-class "Extension Debug Console" in Alga PSA EE that enables extension authors and internal engineers to observe live stdout/stderr and structured logs for their extensions, scoped to specific extension installs and request flows.
The console will:
- Stream debug events (stdout, stderr, structured logs) in near real time via WebSockets or Server-Sent Events (SSE).
- Correlate events with:
- `extension_id`, `tenant_id`, `install_id`
- `request_id`
- `content_hash` / `version_id`
- Respect multi-tenant boundaries, capabilities, and security requirements.
- Be enabled and heavily constrained in dev/staging; opt-in and time-boxed in production.
This plan builds on the Wasmtime/component-based runner and the existing extension metadata + capability model.
## Goals
- [x] Provide a dedicated EE UI page for extension debugging with live log streaming. *(UI at `server/src/app/msp/extensions/[id]/debug/page.tsx` consuming `/api/ext-debug/stream`.)*
- [x] Allow filtering by:
- Specific request flow (`request_id`),
- Extension/install,
- Stream type (stdout, stderr, structured logs). *(Implemented via client-side filters in the debug page.)*
- [x] Implement a runner-side debug event pipeline that captures guest stdout/stderr and host logging events in a structured and correlatable way. *(Runner emits Redis Stream events when `RUNNER_DEBUG_REDIS_URL` is set; proxied by `server/src/app/api/ext-debug/stream/route.ts`.)*
- [ ] Enforce strong authorization and isolation: only appropriate users can see logs for a given tenant/extension.
- [ ] Gate the feature with environment flags and capabilities to avoid accidental leakage or resource abuse.
Status update (2025-11-21): Streaming path (Runner → Redis Streams → SSE endpoint → UI) works; authz/capability gating and production hardening remain to be delivered.
## Non-Goals
- Full-blown distributed tracing across all platform components.
- Long-term persistent log storage or historical search UI.
- Arbitrary tailing of all runner logs for all tenants from EE.
- Overriding the structured provider-based logging model (this feature complements it).
## Architecture
### 1. Debug Event Model
Define a structured debug event that the runner produces for any debuggable signal (stdout/stderr lines, extension log calls, critical host events):
```ts
type ExtDebugEvent = {
ts: string; // ISO 8601
level: 'trace' | 'debug' | 'info' | 'warn' | 'error';
stream: 'stdout' | 'stderr' | 'log';
tenantId?: string;
extensionId?: string;
installId?: string;
requestId?: string;
versionId?: string;
contentHash?: string;
// Raw or structured content
message: string;
fields?: Record<string, unknown>;
// Safety/limits
truncated?: boolean;
};
```
Key rules:
- Always include `extensionId` and `requestId` when available.
- Prefer including `tenantId` and `installId` for multi-tenant visibility and auth decisions.
- `message` is bounded in length; large payloads are truncated with `truncated=true`.
- No secrets: message content must not include decrypted secrets; rely on existing capabilities and filters.
### 2. Runner: Capturing stdout/stderr and Logs
Implement capture and routing inside the runner (Rust):
- Location:
- [`ee/runner/src/engine/loader.rs`](ee/runner/src/engine/loader.rs)
- [`ee/runner/src/engine/host_api.rs`](ee/runner/src/engine/host_api.rs) (for WIT logging interfaces)
- New module: `ee/runner/src/engine/debug.rs` or `ee/runner/src/util/debug_stream.rs` for shared plumbing.
Core behaviors:
1. When instantiating a component for execution:
- Initialize `HostExecutionContext` with:
- `request_id`, `tenant_id`, `extension_id`, `install_id`, `version_id`, config, providers (already present conceptually).
- Attach WASI stdout/stderr to custom sinks that:
- Split by line or chunk.
- Build `ExtDebugEvent` records with `stream: 'stdout' | 'stderr'`.
- Dispatch to:
- `tracing` (with target `ext.stdout` / `ext.stderr`),
- The Redis publisher (see next section) when debug streaming is enabled.
2. For host-side WIT log functions (e.g. `alga.log` provider):
- Generate `ExtDebugEvent` with `stream: 'log'` and appropriate `level`.
- Dispatch similarly via `tracing` and the Redis publisher (when enabled).
3. Configuration:
- Env flags:
- `RUNNER_DEBUG_REDIS_URL`
- `RUNNER_DEBUG_REDIS_STREAM_PREFIX`
- `RUNNER_DEBUG_REDIS_MAXLEN`
- `RUNNER_DEBUG_MAX_EVENT_BYTES` (per event cap)
- Behavior when `RUNNER_DEBUG_REDIS_URL` is unset:
- Continue emitting to `tracing` only (no debug stream fan-out).
### 3. Runner: Redis Debug Stream Publisher
Instead of an in-memory hub, the runner now serializes each `ExtDebugEvent` and appends it to a Redis Stream. Key points:
- Stream naming: `${RUNNER_DEBUG_REDIS_STREAM_PREFIX}{tenantId}:{extensionId}` (tenant falls back to `unknown` when unavailable).
- Command: `XADD <stream> MAXLEN ~ <maxLen> field value ...` with a small bounded payload.
- Each message includes the fields consumed by EE (`ts`, `level`, `stream`, `tenant`, `extension`, `install`, `request`, `version`, `content_hash`, `message`, `truncated`).
- If Redis is down, we log and drop events (mirroring logs via `tracing` so operators can still inspect pod logs).
- Future back-pressure: consider local ring buffer to avoid blocking extension execution if Redis is temporarily unavailable.
Security note:
- Redis credentials are provided via `RUNNER_DEBUG_REDIS_URL` (or a mounted secret). ACLs should scope the runner to `XADD` only for the debug keyspace.
### 4. EE Backend: WebSocket/SSE Proxy
Add an EE API endpoint that exposes a controlled live debug stream to authenticated users.
Suggested route:
- `ee/server/src/app/api/ext-debug/stream/route.ts` (Next.js App Router)
- URL example:
- `/api/ext-debug/stream?extensionId=...&tenantId=...&installId=...&requestId=...`
Behavior:
1. Authentication:
- Require standard session auth.
- Confirm user has one of:
- Internal operator role, or
- Tenant admin for `tenantId`, or
- Extension owner / partner developer tied to the specified extension/install.
- Deny if user attempts to observe another tenants data.
2. Authorization:
- Check:
- The requested `extensionId` belongs to the callers accessible scope.
- If `tenantId` is provided, it matches callers tenant context (unless internal).
- Optional: extension manifest/capabilities include something like `cap:debug.logs` or a server-side allowlist for debug streaming.
3. Subscription handshake:
- On connection:
- Build a subscription filter object:
- Always include `extensionId`.
- Include `tenantId` / `installId` if supplied.
- Include `requestId` if provided for per-flow focus.
- Call runner internal API or RPC:
- e.g., `POST /internal/runner/debug/subscribe` with filter and a signed token,
- Runner returns `debug_session_id`.
- Start a streaming loop that:
- Pulls `ExtDebugEvent` from runner (via:
- a streaming HTTP endpoint,
- or a long-lived connection,
- or a broker / message bus, depending on infra),
- Forwards events to the client via WebSockets or SSE.
4. Transport details:
- Recommended for simplicity:
- SSE for first implementation:
- One-way stream, simple to proxy.
- Events framed as `data: { ...ExtDebugEvent... }\n\n`.
- WebSockets if bidirectional control desired later:
- e.g., changing filters, pausing, etc.
5. Limits and lifecycle:
- Enforce:
- Max session duration (e.g. 515 minutes; extendable).
- Close stream when:
- TTL exceeded,
- User navigates away,
- Runner cancels subscription.
- Provide:
- `x-debug-truncated: true` or event-level `truncated` when server-side limits hit.
- Clear documentation in UI when data may be incomplete.
### 5. EE UI: Extension Debug Console
Add a dedicated page that consumes the stream:
Suggested route:
- `/msp/extensions/[extensionId]/debug`
- For internal operators:
- Additional entry: `/ee/extensions/[extensionId]/debug`
Features:
- Filters:
- Extension (from URL).
- Tenant/install (dropdown or inferred from context).
- Request mode:
- “All requests” for that extension/install.
- “Specific request” by `requestId`.
- Stream viewer:
- Connect/disconnect button.
- Live log panel:
- Color-coded:
- stdout (neutral),
- stderr (red),
- structured logs (level-specific colors).
- Shows timestamp and key metadata (tenant, install, req id).
- Controls:
- Pause/resume auto-scroll.
- Toggle stdout/stderr/log.
- Clear buffer.
- DX helpers:
- Show “How to correlate” help:
- e.g., “Use `request_id` from extension errors or logs to narrow to a single flow.”
- For dev:
- Example snippet for extension authors:
- `logInfo("debug marker: X")` usage,
- explaining how it appears in the console.
### 6. Capabilities, Flags, and Safety
To avoid accidental misuse:
- Capability gating:
- Optionally require a capability at install/manifest level:
- `cap:debug.logs` or similar; when absent, EE refuses debug sessions for that extension except for privileged internal users.
- Environment flags (runner + EE):
- `RUNNER_DEBUG_REDIS_URL`
- `RUNNER_DEBUG_REDIS_STREAM_PREFIX`
- EE-side:
- `EXT_DEBUG_UI_ENABLED`
- Rate limiting:
- EE API-level rate limits per user/tenant.
- Runner-level caps on sessions and throughput.
- Data retention:
- By design, this feature is for *live* debugging:
- Buffers are short-lived.
- Persistent historical logs remain in standard infra (e.g. Loki/ELK) under operator control.
### 7. Implementation Phases
#### Phase 1 — Runner Event Capture
- [x] Implement `ExtDebugEvent` type and the Redis publisher in the runner.
- Implemented in `ee/runner/src/engine/debug.rs` and `debug_redis.rs`.
- [x] Route:
- WIT log provider calls to event producer.
- Implemented in `ee/runner/src/engine/host_api.rs` to forward `log_info/log_warn/log_error` into Redis.
- WASI stderr wired to event producer (initial implementation).
- Implemented in `ee/runner/src/engine/loader.rs` via a custom `stderr` pipe that forwards guest stderr lines into Redis when enabled.
- (Optional stdout mirroring remains off by default to avoid noise; can be added later if needed.)
- [ ] Add basic unit tests:
- stdout/stderr captured and tagged with correct metadata.
#### Phase 2 — Internal Streaming API (Legacy)
- [ ] (Deprecated) The original SSE endpoint at `/internal/ext-debug/stream` has been removed now that Redis fan-out is the canonical path.
- [x] Implement EE backend `/api/ext-debug/stream`:
- Implemented at `server/src/app/api/ext-debug/stream/route.ts`:
- AuthN + AuthZ via existing helpers.
- Forwards `extensionId`/`tenantId`/`installId`/`requestId` filter to runner using `x-ext-debug-filter`.
- Relays SSE stream response directly to clients.
- [ ] Add integration tests / local harness:
- Fake extension emitting stdout/structured logs.
- Confirm events appear via `/api/ext-debug/stream`.
#### Phase 3 — EE Debug Console UI
- [x] Build `/msp/extensions/[extensionId]/debug` page:
- Implemented at `server/src/app/msp/extensions/[extensionId]/debug/page.tsx`.
- Connects to `/api/ext-debug/stream` using `EventSource`.
- Supports filters for `tenantId`, `installId`, and `requestId`.
- Renders a live console with:
- stdout/stderr/log classification,
- connection state,
- auto-scroll toggle,
- bounded history to avoid unbounded memory.
- [x] Add navigation entry points:
- Implement by linking from the extensions settings UI at `/msp/settings?tab=extensions`:
- For each extension row, add a "Debug Console" action targeting:
- `/msp/extensions/{extensionId}/debug`
- Optionally preserve `tenantId`/`installId` in query params.
- This hooks the existing settings-based extensions screen (the canonical management surface) directly into the debug page for the selected extension.
- [x] Document how extension authors:
- Inline help on the debug page explains:
- Required runner configuration (`RUNNER_DEBUG_REDIS_URL`, stream prefix, Redis ACL credentials).
- Using structured logging helpers instead of printing secrets.
- Using `x-request-id` / `context.request_id` and filters to follow specific request flows.
#### Phase 4 — Hardening & Production Policy
- [ ] Add capability and tenant-scoped policy checks.
- [ ] Add robust truncation, redaction (optional regex-based guardrails), and audit logs:
- Who opened debug sessions, for which extension/tenant, and when.
- [ ] Define environment policies:
- Fully enabled in dev/staging.
- In prod:
- Off by default.
- Can be enabled per tenant/extension with admin approval or for time-limited debugging.
#### Phase 5 — Distributed Event Bus (Redis Streams)
_Motivation: In production, Knative fans requests across runner pods. A Redis-backed fan-out ensures the debug console aggregates logs across all pods and avoids the Kourier routing issues we hit with `runner.msp.svc.cluster.local`._
- [ ] Provision a Redis cluster/namespace dedicated to short-lived “debug events” with strong authentication and TTL defaults (e.g., 15min retention).
- [ ] Define stream partitioning: e.g., `ext-debug:<extension_id>` or sharded by `tenant_id:extension_id`. Document key structure, retention policy, and serialization (JSON `ExtDebugEvent`).
- [ ] Extend the runner:
- Add optional `RUNNER_DEBUG_REDIS_URL`, `RUNNER_DEBUG_REDIS_STREAM_PREFIX`, `RUNNER_DEBUG_REDIS_MAXLEN` (and future TLS/password flags) env vars so operators can point at the shared Redis cluster without changing code.
- On each event, enqueue to Redis Streams (or Pub/Sub) with bounded async buffering; Redis replaces the in-memory `DebugHub` entirely.
- Tag events with a monotonic sequence (`xadd` ID) to preserve ordering.
- Include metrics + back-pressure handling (drop oldest events, emit warnings) when Redis is unavailable.
- [ ] Reuse existing Redis stream plumbing where possible:
- `server/src/lib/eventBus/index.ts` already manages `XADD`/`XREADGROUP`, consumer groups, trimming, and retry logic.
- `shared/workflow/streams/redisStreamClient.ts` shows how to wrap publish/read/ack helpers.
- Mirror those patterns for debug streams (new `DebugStreamClient`) instead of reinventing connection management.
- [ ] Build a lightweight “debug-stream fan-out” worker (can live inside the EE server or as a sidecar) that tails Redis Streams via consumer groups, applies filters server-side, and relays to subscribers.
- [ ] Security: reuse existing `x-runner-auth` token for publishing auth, and create a dedicated Redis ACL role that only allows XADD/XLEN on the debug keys.
#### Phase 6 — EE Proxy Migration to Redis Streams
- [ ] Update `/api/ext-debug/stream` so that, when the Redis-backed mode is enabled (`EXT_DEBUG_STREAM_MODE=redis`), it:
- Validates the user/session as before.
- Registers/updates a consumer group per `extensionId` (e.g., `ee-debug-ui`).
- Issues `XREADGROUP` with filters (`tenantId`, `installId`, `requestId`) applied server-side before emitting SSE events.
- Implements heartbeats + idempotent acking so abandoned sessions dont stall the stream.
- [ ] Add multi-tenant scoping at the stream level by embedding tenant + install IDs in stream entries and filtering at the EE layer.
- [ ] Provide fallbacks: if Redis is unavailable, drop back to the legacy per-pod proxy with an explicit warning in the UI (“live stream limited to a single runner pod”).
- [ ] Update the debug console copy to explain that live events now aggregate across all runner replicas (when Redis mode is active).
#### Phase 7 — Remove Per-Pod Dependency & Operability
- [ ] Once the Redis path is proven in production, disable direct `/internal/ext-debug/stream` access from EE (keep it only for diagnostics).
- [ ] Simplify runner configuration: require either Redis streaming _or_ a dedicated `runner-private` ClusterIP if Redis is disabled, so we do not rely on Kourier host matching.
- [ ] Add observability:
- Metrics for stream lag, consumer group backlog, dropped events.
- Alerting when Redis retention drops events because of sustained back-pressure.
- [ ] Document upgrade/rollback steps so operators can toggle between legacy and Redis-backed streaming without dropping all sessions.
## Dependencies & Coordination
- Runner team:
- Implement `ExtDebugEvent`, stdout/stderr capture, DebugHub, and internal streaming API.
- EE server/gateway team:
- Add `/api/ext-debug/stream` with proper auth.
- Wire request_id propagation end-to-end.
- Platform/Infrastructure:
- Operate the Redis cluster/streams (Phase 5+) with appropriate ACLs, backups, and monitoring.
- Expose a stable internal DNS name (or service) for the runner if Redis is disabled, so EE does not depend on Kourier host headers for intra-cluster calls.
- DX/Docs:
- Update `ee/docs/extension-system/development_guide.md` and related docs to include:
- How to use the debug console.
- Expected constraints and policies.
- Security/compliance:
- Review exposure model, logging content policies, and retention defaults.
## Summary
This plan introduces a focused, auth-aware Extension Debug Console within EE that streams live debug events from the Wasmtime-based runner, scoped by extension/install/request. It is:
- Concrete enough to implement incrementally.
- Safe for multi-tenant environments when flags and capabilities are observed.
- Highly valuable for extension authors who need real-time visibility into their code without direct infrastructure access.