Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
365 lines
18 KiB
Markdown
365 lines
18 KiB
Markdown
# Extension Debug Stream UI Plan (EE Live Debug Console for Runner-Based Extensions)
|
||
|
||
## Overview
|
||
|
||
Introduce a first-class "Extension Debug Console" in Alga PSA EE that enables extension authors and internal engineers to observe live stdout/stderr and structured logs for their extensions, scoped to specific extension installs and request flows.
|
||
|
||
The console will:
|
||
|
||
- Stream debug events (stdout, stderr, structured logs) in near real time via WebSockets or Server-Sent Events (SSE).
|
||
- Correlate events with:
|
||
- `extension_id`, `tenant_id`, `install_id`
|
||
- `request_id`
|
||
- `content_hash` / `version_id`
|
||
- Respect multi-tenant boundaries, capabilities, and security requirements.
|
||
- Be enabled and heavily constrained in dev/staging; opt-in and time-boxed in production.
|
||
|
||
This plan builds on the Wasmtime/component-based runner and the existing extension metadata + capability model.
|
||
|
||
## Goals
|
||
|
||
- [x] Provide a dedicated EE UI page for extension debugging with live log streaming. *(UI at `server/src/app/msp/extensions/[id]/debug/page.tsx` consuming `/api/ext-debug/stream`.)*
|
||
- [x] Allow filtering by:
|
||
- Specific request flow (`request_id`),
|
||
- Extension/install,
|
||
- Stream type (stdout, stderr, structured logs). *(Implemented via client-side filters in the debug page.)*
|
||
- [x] Implement a runner-side debug event pipeline that captures guest stdout/stderr and host logging events in a structured and correlatable way. *(Runner emits Redis Stream events when `RUNNER_DEBUG_REDIS_URL` is set; proxied by `server/src/app/api/ext-debug/stream/route.ts`.)*
|
||
- [ ] Enforce strong authorization and isolation: only appropriate users can see logs for a given tenant/extension.
|
||
- [ ] Gate the feature with environment flags and capabilities to avoid accidental leakage or resource abuse.
|
||
|
||
Status update (2025-11-21): Streaming path (Runner → Redis Streams → SSE endpoint → UI) works; authz/capability gating and production hardening remain to be delivered.
|
||
|
||
## Non-Goals
|
||
|
||
- Full-blown distributed tracing across all platform components.
|
||
- Long-term persistent log storage or historical search UI.
|
||
- Arbitrary tailing of all runner logs for all tenants from EE.
|
||
- Overriding the structured provider-based logging model (this feature complements it).
|
||
|
||
## Architecture
|
||
|
||
### 1. Debug Event Model
|
||
|
||
Define a structured debug event that the runner produces for any debuggable signal (stdout/stderr lines, extension log calls, critical host events):
|
||
|
||
```ts
|
||
type ExtDebugEvent = {
|
||
ts: string; // ISO 8601
|
||
level: 'trace' | 'debug' | 'info' | 'warn' | 'error';
|
||
stream: 'stdout' | 'stderr' | 'log';
|
||
tenantId?: string;
|
||
extensionId?: string;
|
||
installId?: string;
|
||
requestId?: string;
|
||
versionId?: string;
|
||
contentHash?: string;
|
||
|
||
// Raw or structured content
|
||
message: string;
|
||
fields?: Record<string, unknown>;
|
||
|
||
// Safety/limits
|
||
truncated?: boolean;
|
||
};
|
||
```
|
||
|
||
Key rules:
|
||
|
||
- Always include `extensionId` and `requestId` when available.
|
||
- Prefer including `tenantId` and `installId` for multi-tenant visibility and auth decisions.
|
||
- `message` is bounded in length; large payloads are truncated with `truncated=true`.
|
||
- No secrets: message content must not include decrypted secrets; rely on existing capabilities and filters.
|
||
|
||
### 2. Runner: Capturing stdout/stderr and Logs
|
||
|
||
Implement capture and routing inside the runner (Rust):
|
||
|
||
- Location:
|
||
- [`ee/runner/src/engine/loader.rs`](ee/runner/src/engine/loader.rs)
|
||
- [`ee/runner/src/engine/host_api.rs`](ee/runner/src/engine/host_api.rs) (for WIT logging interfaces)
|
||
- New module: `ee/runner/src/engine/debug.rs` or `ee/runner/src/util/debug_stream.rs` for shared plumbing.
|
||
|
||
Core behaviors:
|
||
|
||
1. When instantiating a component for execution:
|
||
- Initialize `HostExecutionContext` with:
|
||
- `request_id`, `tenant_id`, `extension_id`, `install_id`, `version_id`, config, providers (already present conceptually).
|
||
- Attach WASI stdout/stderr to custom sinks that:
|
||
- Split by line or chunk.
|
||
- Build `ExtDebugEvent` records with `stream: 'stdout' | 'stderr'`.
|
||
- Dispatch to:
|
||
- `tracing` (with target `ext.stdout` / `ext.stderr`),
|
||
- The Redis publisher (see next section) when debug streaming is enabled.
|
||
|
||
2. For host-side WIT log functions (e.g. `alga.log` provider):
|
||
- Generate `ExtDebugEvent` with `stream: 'log'` and appropriate `level`.
|
||
- Dispatch similarly via `tracing` and the Redis publisher (when enabled).
|
||
|
||
3. Configuration:
|
||
- Env flags:
|
||
- `RUNNER_DEBUG_REDIS_URL`
|
||
- `RUNNER_DEBUG_REDIS_STREAM_PREFIX`
|
||
- `RUNNER_DEBUG_REDIS_MAXLEN`
|
||
- `RUNNER_DEBUG_MAX_EVENT_BYTES` (per event cap)
|
||
- Behavior when `RUNNER_DEBUG_REDIS_URL` is unset:
|
||
- Continue emitting to `tracing` only (no debug stream fan-out).
|
||
|
||
### 3. Runner: Redis Debug Stream Publisher
|
||
|
||
Instead of an in-memory hub, the runner now serializes each `ExtDebugEvent` and appends it to a Redis Stream. Key points:
|
||
|
||
- Stream naming: `${RUNNER_DEBUG_REDIS_STREAM_PREFIX}{tenantId}:{extensionId}` (tenant falls back to `unknown` when unavailable).
|
||
- Command: `XADD <stream> MAXLEN ~ <maxLen> field value ...` with a small bounded payload.
|
||
- Each message includes the fields consumed by EE (`ts`, `level`, `stream`, `tenant`, `extension`, `install`, `request`, `version`, `content_hash`, `message`, `truncated`).
|
||
- If Redis is down, we log and drop events (mirroring logs via `tracing` so operators can still inspect pod logs).
|
||
- Future back-pressure: consider local ring buffer to avoid blocking extension execution if Redis is temporarily unavailable.
|
||
|
||
Security note:
|
||
|
||
- Redis credentials are provided via `RUNNER_DEBUG_REDIS_URL` (or a mounted secret). ACLs should scope the runner to `XADD` only for the debug keyspace.
|
||
|
||
### 4. EE Backend: WebSocket/SSE Proxy
|
||
|
||
Add an EE API endpoint that exposes a controlled live debug stream to authenticated users.
|
||
|
||
Suggested route:
|
||
|
||
- `ee/server/src/app/api/ext-debug/stream/route.ts` (Next.js App Router)
|
||
- URL example:
|
||
- `/api/ext-debug/stream?extensionId=...&tenantId=...&installId=...&requestId=...`
|
||
|
||
Behavior:
|
||
|
||
1. Authentication:
|
||
- Require standard session auth.
|
||
- Confirm user has one of:
|
||
- Internal operator role, or
|
||
- Tenant admin for `tenantId`, or
|
||
- Extension owner / partner developer tied to the specified extension/install.
|
||
- Deny if user attempts to observe another tenant’s data.
|
||
|
||
2. Authorization:
|
||
- Check:
|
||
- The requested `extensionId` belongs to the caller’s accessible scope.
|
||
- If `tenantId` is provided, it matches caller’s tenant context (unless internal).
|
||
- Optional: extension manifest/capabilities include something like `cap:debug.logs` or a server-side allowlist for debug streaming.
|
||
|
||
3. Subscription handshake:
|
||
- On connection:
|
||
- Build a subscription filter object:
|
||
- Always include `extensionId`.
|
||
- Include `tenantId` / `installId` if supplied.
|
||
- Include `requestId` if provided for per-flow focus.
|
||
- Call runner internal API or RPC:
|
||
- e.g., `POST /internal/runner/debug/subscribe` with filter and a signed token,
|
||
- Runner returns `debug_session_id`.
|
||
- Start a streaming loop that:
|
||
- Pulls `ExtDebugEvent` from runner (via:
|
||
- a streaming HTTP endpoint,
|
||
- or a long-lived connection,
|
||
- or a broker / message bus, depending on infra),
|
||
- Forwards events to the client via WebSockets or SSE.
|
||
|
||
4. Transport details:
|
||
|
||
- Recommended for simplicity:
|
||
- SSE for first implementation:
|
||
- One-way stream, simple to proxy.
|
||
- Events framed as `data: { ...ExtDebugEvent... }\n\n`.
|
||
- WebSockets if bidirectional control desired later:
|
||
- e.g., changing filters, pausing, etc.
|
||
|
||
5. Limits and lifecycle:
|
||
|
||
- Enforce:
|
||
- Max session duration (e.g. 5–15 minutes; extendable).
|
||
- Close stream when:
|
||
- TTL exceeded,
|
||
- User navigates away,
|
||
- Runner cancels subscription.
|
||
- Provide:
|
||
- `x-debug-truncated: true` or event-level `truncated` when server-side limits hit.
|
||
- Clear documentation in UI when data may be incomplete.
|
||
|
||
### 5. EE UI: Extension Debug Console
|
||
|
||
Add a dedicated page that consumes the stream:
|
||
|
||
Suggested route:
|
||
|
||
- `/msp/extensions/[extensionId]/debug`
|
||
- For internal operators:
|
||
- Additional entry: `/ee/extensions/[extensionId]/debug`
|
||
|
||
Features:
|
||
|
||
- Filters:
|
||
- Extension (from URL).
|
||
- Tenant/install (dropdown or inferred from context).
|
||
- Request mode:
|
||
- “All requests” for that extension/install.
|
||
- “Specific request” by `requestId`.
|
||
- Stream viewer:
|
||
- Connect/disconnect button.
|
||
- Live log panel:
|
||
- Color-coded:
|
||
- stdout (neutral),
|
||
- stderr (red),
|
||
- structured logs (level-specific colors).
|
||
- Shows timestamp and key metadata (tenant, install, req id).
|
||
- Controls:
|
||
- Pause/resume auto-scroll.
|
||
- Toggle stdout/stderr/log.
|
||
- Clear buffer.
|
||
- DX helpers:
|
||
- Show “How to correlate” help:
|
||
- e.g., “Use `request_id` from extension errors or logs to narrow to a single flow.”
|
||
- For dev:
|
||
- Example snippet for extension authors:
|
||
- `logInfo("debug marker: X")` usage,
|
||
- explaining how it appears in the console.
|
||
|
||
### 6. Capabilities, Flags, and Safety
|
||
|
||
To avoid accidental misuse:
|
||
|
||
- Capability gating:
|
||
- Optionally require a capability at install/manifest level:
|
||
- `cap:debug.logs` or similar; when absent, EE refuses debug sessions for that extension except for privileged internal users.
|
||
- Environment flags (runner + EE):
|
||
- `RUNNER_DEBUG_REDIS_URL`
|
||
- `RUNNER_DEBUG_REDIS_STREAM_PREFIX`
|
||
- EE-side:
|
||
- `EXT_DEBUG_UI_ENABLED`
|
||
- Rate limiting:
|
||
- EE API-level rate limits per user/tenant.
|
||
- Runner-level caps on sessions and throughput.
|
||
- Data retention:
|
||
- By design, this feature is for *live* debugging:
|
||
- Buffers are short-lived.
|
||
- Persistent historical logs remain in standard infra (e.g. Loki/ELK) under operator control.
|
||
|
||
### 7. Implementation Phases
|
||
|
||
#### Phase 1 — Runner Event Capture
|
||
|
||
- [x] Implement `ExtDebugEvent` type and the Redis publisher in the runner.
|
||
- Implemented in `ee/runner/src/engine/debug.rs` and `debug_redis.rs`.
|
||
- [x] Route:
|
||
- WIT log provider calls to event producer.
|
||
- Implemented in `ee/runner/src/engine/host_api.rs` to forward `log_info/log_warn/log_error` into Redis.
|
||
- WASI stderr wired to event producer (initial implementation).
|
||
- Implemented in `ee/runner/src/engine/loader.rs` via a custom `stderr` pipe that forwards guest stderr lines into Redis when enabled.
|
||
- (Optional stdout mirroring remains off by default to avoid noise; can be added later if needed.)
|
||
- [ ] Add basic unit tests:
|
||
- stdout/stderr captured and tagged with correct metadata.
|
||
|
||
#### Phase 2 — Internal Streaming API (Legacy)
|
||
|
||
- [ ] (Deprecated) The original SSE endpoint at `/internal/ext-debug/stream` has been removed now that Redis fan-out is the canonical path.
|
||
- [x] Implement EE backend `/api/ext-debug/stream`:
|
||
- Implemented at `server/src/app/api/ext-debug/stream/route.ts`:
|
||
- AuthN + AuthZ via existing helpers.
|
||
- Forwards `extensionId`/`tenantId`/`installId`/`requestId` filter to runner using `x-ext-debug-filter`.
|
||
- Relays SSE stream response directly to clients.
|
||
- [ ] Add integration tests / local harness:
|
||
- Fake extension emitting stdout/structured logs.
|
||
- Confirm events appear via `/api/ext-debug/stream`.
|
||
|
||
#### Phase 3 — EE Debug Console UI
|
||
|
||
- [x] Build `/msp/extensions/[extensionId]/debug` page:
|
||
- Implemented at `server/src/app/msp/extensions/[extensionId]/debug/page.tsx`.
|
||
- Connects to `/api/ext-debug/stream` using `EventSource`.
|
||
- Supports filters for `tenantId`, `installId`, and `requestId`.
|
||
- Renders a live console with:
|
||
- stdout/stderr/log classification,
|
||
- connection state,
|
||
- auto-scroll toggle,
|
||
- bounded history to avoid unbounded memory.
|
||
- [x] Add navigation entry points:
|
||
- Implement by linking from the extensions settings UI at `/msp/settings?tab=extensions`:
|
||
- For each extension row, add a "Debug Console" action targeting:
|
||
- `/msp/extensions/{extensionId}/debug`
|
||
- Optionally preserve `tenantId`/`installId` in query params.
|
||
- This hooks the existing settings-based extensions screen (the canonical management surface) directly into the debug page for the selected extension.
|
||
- [x] Document how extension authors:
|
||
- Inline help on the debug page explains:
|
||
- Required runner configuration (`RUNNER_DEBUG_REDIS_URL`, stream prefix, Redis ACL credentials).
|
||
- Using structured logging helpers instead of printing secrets.
|
||
- Using `x-request-id` / `context.request_id` and filters to follow specific request flows.
|
||
|
||
#### Phase 4 — Hardening & Production Policy
|
||
|
||
- [ ] Add capability and tenant-scoped policy checks.
|
||
- [ ] Add robust truncation, redaction (optional regex-based guardrails), and audit logs:
|
||
- Who opened debug sessions, for which extension/tenant, and when.
|
||
- [ ] Define environment policies:
|
||
- Fully enabled in dev/staging.
|
||
- In prod:
|
||
- Off by default.
|
||
- Can be enabled per tenant/extension with admin approval or for time-limited debugging.
|
||
|
||
#### Phase 5 — Distributed Event Bus (Redis Streams)
|
||
|
||
_Motivation: In production, Knative fans requests across runner pods. A Redis-backed fan-out ensures the debug console aggregates logs across all pods and avoids the Kourier routing issues we hit with `runner.msp.svc.cluster.local`._
|
||
|
||
- [ ] Provision a Redis cluster/namespace dedicated to short-lived “debug events” with strong authentication and TTL defaults (e.g., 15 min retention).
|
||
- [ ] Define stream partitioning: e.g., `ext-debug:<extension_id>` or sharded by `tenant_id:extension_id`. Document key structure, retention policy, and serialization (JSON `ExtDebugEvent`).
|
||
- [ ] Extend the runner:
|
||
- Add optional `RUNNER_DEBUG_REDIS_URL`, `RUNNER_DEBUG_REDIS_STREAM_PREFIX`, `RUNNER_DEBUG_REDIS_MAXLEN` (and future TLS/password flags) env vars so operators can point at the shared Redis cluster without changing code.
|
||
- On each event, enqueue to Redis Streams (or Pub/Sub) with bounded async buffering; Redis replaces the in-memory `DebugHub` entirely.
|
||
- Tag events with a monotonic sequence (`xadd` ID) to preserve ordering.
|
||
- Include metrics + back-pressure handling (drop oldest events, emit warnings) when Redis is unavailable.
|
||
- [ ] Reuse existing Redis stream plumbing where possible:
|
||
- `server/src/lib/eventBus/index.ts` already manages `XADD`/`XREADGROUP`, consumer groups, trimming, and retry logic.
|
||
- `shared/workflow/streams/redisStreamClient.ts` shows how to wrap publish/read/ack helpers.
|
||
- Mirror those patterns for debug streams (new `DebugStreamClient`) instead of reinventing connection management.
|
||
- [ ] Build a lightweight “debug-stream fan-out” worker (can live inside the EE server or as a sidecar) that tails Redis Streams via consumer groups, applies filters server-side, and relays to subscribers.
|
||
- [ ] Security: reuse existing `x-runner-auth` token for publishing auth, and create a dedicated Redis ACL role that only allows XADD/XLEN on the debug keys.
|
||
|
||
#### Phase 6 — EE Proxy Migration to Redis Streams
|
||
|
||
- [ ] Update `/api/ext-debug/stream` so that, when the Redis-backed mode is enabled (`EXT_DEBUG_STREAM_MODE=redis`), it:
|
||
- Validates the user/session as before.
|
||
- Registers/updates a consumer group per `extensionId` (e.g., `ee-debug-ui`).
|
||
- Issues `XREADGROUP` with filters (`tenantId`, `installId`, `requestId`) applied server-side before emitting SSE events.
|
||
- Implements heartbeats + idempotent acking so abandoned sessions don’t stall the stream.
|
||
- [ ] Add multi-tenant scoping at the stream level by embedding tenant + install IDs in stream entries and filtering at the EE layer.
|
||
- [ ] Provide fallbacks: if Redis is unavailable, drop back to the legacy per-pod proxy with an explicit warning in the UI (“live stream limited to a single runner pod”).
|
||
- [ ] Update the debug console copy to explain that live events now aggregate across all runner replicas (when Redis mode is active).
|
||
|
||
#### Phase 7 — Remove Per-Pod Dependency & Operability
|
||
|
||
- [ ] Once the Redis path is proven in production, disable direct `/internal/ext-debug/stream` access from EE (keep it only for diagnostics).
|
||
- [ ] Simplify runner configuration: require either Redis streaming _or_ a dedicated `runner-private` ClusterIP if Redis is disabled, so we do not rely on Kourier host matching.
|
||
- [ ] Add observability:
|
||
- Metrics for stream lag, consumer group backlog, dropped events.
|
||
- Alerting when Redis retention drops events because of sustained back-pressure.
|
||
- [ ] Document upgrade/rollback steps so operators can toggle between legacy and Redis-backed streaming without dropping all sessions.
|
||
|
||
## Dependencies & Coordination
|
||
|
||
- Runner team:
|
||
- Implement `ExtDebugEvent`, stdout/stderr capture, DebugHub, and internal streaming API.
|
||
- EE server/gateway team:
|
||
- Add `/api/ext-debug/stream` with proper auth.
|
||
- Wire request_id propagation end-to-end.
|
||
- Platform/Infrastructure:
|
||
- Operate the Redis cluster/streams (Phase 5+) with appropriate ACLs, backups, and monitoring.
|
||
- Expose a stable internal DNS name (or service) for the runner if Redis is disabled, so EE does not depend on Kourier host headers for intra-cluster calls.
|
||
- DX/Docs:
|
||
- Update `ee/docs/extension-system/development_guide.md` and related docs to include:
|
||
- How to use the debug console.
|
||
- Expected constraints and policies.
|
||
- Security/compliance:
|
||
- Review exposure model, logging content policies, and retention defaults.
|
||
|
||
## Summary
|
||
|
||
This plan introduces a focused, auth-aware Extension Debug Console within EE that streams live debug events from the Wasmtime-based runner, scoped by extension/install/request. It is:
|
||
|
||
- Concrete enough to implement incrementally.
|
||
- Safe for multi-tenant environments when flags and capabilities are observed.
|
||
- Highly valuable for extension authors who need real-time visibility into their code without direct infrastructure access.
|