Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
18 KiB
Extension Debug Stream UI Plan (EE Live Debug Console for Runner-Based Extensions)
Overview
Introduce a first-class "Extension Debug Console" in Alga PSA EE that enables extension authors and internal engineers to observe live stdout/stderr and structured logs for their extensions, scoped to specific extension installs and request flows.
The console will:
- Stream debug events (stdout, stderr, structured logs) in near real time via WebSockets or Server-Sent Events (SSE).
- Correlate events with:
extension_id,tenant_id,install_idrequest_idcontent_hash/version_id
- Respect multi-tenant boundaries, capabilities, and security requirements.
- Be enabled and heavily constrained in dev/staging; opt-in and time-boxed in production.
This plan builds on the Wasmtime/component-based runner and the existing extension metadata + capability model.
Goals
- Provide a dedicated EE UI page for extension debugging with live log streaming. (UI at
server/src/app/msp/extensions/[id]/debug/page.tsxconsuming/api/ext-debug/stream.) - Allow filtering by:
- Specific request flow (
request_id), - Extension/install,
- Stream type (stdout, stderr, structured logs). (Implemented via client-side filters in the debug page.)
- Specific request flow (
- Implement a runner-side debug event pipeline that captures guest stdout/stderr and host logging events in a structured and correlatable way. (Runner emits Redis Stream events when
RUNNER_DEBUG_REDIS_URLis set; proxied byserver/src/app/api/ext-debug/stream/route.ts.) - Enforce strong authorization and isolation: only appropriate users can see logs for a given tenant/extension.
- Gate the feature with environment flags and capabilities to avoid accidental leakage or resource abuse.
Status update (2025-11-21): Streaming path (Runner → Redis Streams → SSE endpoint → UI) works; authz/capability gating and production hardening remain to be delivered.
Non-Goals
- Full-blown distributed tracing across all platform components.
- Long-term persistent log storage or historical search UI.
- Arbitrary tailing of all runner logs for all tenants from EE.
- Overriding the structured provider-based logging model (this feature complements it).
Architecture
1. Debug Event Model
Define a structured debug event that the runner produces for any debuggable signal (stdout/stderr lines, extension log calls, critical host events):
type ExtDebugEvent = {
ts: string; // ISO 8601
level: 'trace' | 'debug' | 'info' | 'warn' | 'error';
stream: 'stdout' | 'stderr' | 'log';
tenantId?: string;
extensionId?: string;
installId?: string;
requestId?: string;
versionId?: string;
contentHash?: string;
// Raw or structured content
message: string;
fields?: Record<string, unknown>;
// Safety/limits
truncated?: boolean;
};
Key rules:
- Always include
extensionIdandrequestIdwhen available. - Prefer including
tenantIdandinstallIdfor multi-tenant visibility and auth decisions. messageis bounded in length; large payloads are truncated withtruncated=true.- No secrets: message content must not include decrypted secrets; rely on existing capabilities and filters.
2. Runner: Capturing stdout/stderr and Logs
Implement capture and routing inside the runner (Rust):
- Location:
ee/runner/src/engine/loader.rsee/runner/src/engine/host_api.rs(for WIT logging interfaces)- New module:
ee/runner/src/engine/debug.rsoree/runner/src/util/debug_stream.rsfor shared plumbing.
Core behaviors:
-
When instantiating a component for execution:
- Initialize
HostExecutionContextwith:request_id,tenant_id,extension_id,install_id,version_id, config, providers (already present conceptually).
- Attach WASI stdout/stderr to custom sinks that:
- Split by line or chunk.
- Build
ExtDebugEventrecords withstream: 'stdout' | 'stderr'. - Dispatch to:
tracing(with targetext.stdout/ext.stderr),- The Redis publisher (see next section) when debug streaming is enabled.
- Initialize
-
For host-side WIT log functions (e.g.
alga.logprovider):- Generate
ExtDebugEventwithstream: 'log'and appropriatelevel. - Dispatch similarly via
tracingand the Redis publisher (when enabled).
- Generate
-
Configuration:
- Env flags:
RUNNER_DEBUG_REDIS_URLRUNNER_DEBUG_REDIS_STREAM_PREFIXRUNNER_DEBUG_REDIS_MAXLENRUNNER_DEBUG_MAX_EVENT_BYTES(per event cap)
- Behavior when
RUNNER_DEBUG_REDIS_URLis unset:- Continue emitting to
tracingonly (no debug stream fan-out).
- Continue emitting to
- Env flags:
3. Runner: Redis Debug Stream Publisher
Instead of an in-memory hub, the runner now serializes each ExtDebugEvent and appends it to a Redis Stream. Key points:
- Stream naming:
${RUNNER_DEBUG_REDIS_STREAM_PREFIX}{tenantId}:{extensionId}(tenant falls back tounknownwhen unavailable). - Command:
XADD <stream> MAXLEN ~ <maxLen> field value ...with a small bounded payload. - Each message includes the fields consumed by EE (
ts,level,stream,tenant,extension,install,request,version,content_hash,message,truncated). - If Redis is down, we log and drop events (mirroring logs via
tracingso operators can still inspect pod logs). - Future back-pressure: consider local ring buffer to avoid blocking extension execution if Redis is temporarily unavailable.
Security note:
- Redis credentials are provided via
RUNNER_DEBUG_REDIS_URL(or a mounted secret). ACLs should scope the runner toXADDonly for the debug keyspace.
4. EE Backend: WebSocket/SSE Proxy
Add an EE API endpoint that exposes a controlled live debug stream to authenticated users.
Suggested route:
ee/server/src/app/api/ext-debug/stream/route.ts(Next.js App Router)- URL example:
/api/ext-debug/stream?extensionId=...&tenantId=...&installId=...&requestId=...
Behavior:
-
Authentication:
- Require standard session auth.
- Confirm user has one of:
- Internal operator role, or
- Tenant admin for
tenantId, or - Extension owner / partner developer tied to the specified extension/install.
- Deny if user attempts to observe another tenant’s data.
-
Authorization:
- Check:
- The requested
extensionIdbelongs to the caller’s accessible scope. - If
tenantIdis provided, it matches caller’s tenant context (unless internal). - Optional: extension manifest/capabilities include something like
cap:debug.logsor a server-side allowlist for debug streaming.
- The requested
- Check:
-
Subscription handshake:
- On connection:
- Build a subscription filter object:
- Always include
extensionId. - Include
tenantId/installIdif supplied. - Include
requestIdif provided for per-flow focus.
- Always include
- Call runner internal API or RPC:
- e.g.,
POST /internal/runner/debug/subscribewith filter and a signed token, - Runner returns
debug_session_id.
- e.g.,
- Start a streaming loop that:
- Pulls
ExtDebugEventfrom runner (via:- a streaming HTTP endpoint,
- or a long-lived connection,
- or a broker / message bus, depending on infra),
- Forwards events to the client via WebSockets or SSE.
- Pulls
- Build a subscription filter object:
- On connection:
-
Transport details:
- Recommended for simplicity:
- SSE for first implementation:
- One-way stream, simple to proxy.
- Events framed as
data: { ...ExtDebugEvent... }\n\n.
- WebSockets if bidirectional control desired later:
- e.g., changing filters, pausing, etc.
- SSE for first implementation:
- Limits and lifecycle:
- Enforce:
- Max session duration (e.g. 5–15 minutes; extendable).
- Close stream when:
- TTL exceeded,
- User navigates away,
- Runner cancels subscription.
- Provide:
x-debug-truncated: trueor event-leveltruncatedwhen server-side limits hit.- Clear documentation in UI when data may be incomplete.
5. EE UI: Extension Debug Console
Add a dedicated page that consumes the stream:
Suggested route:
/msp/extensions/[extensionId]/debug- For internal operators:
- Additional entry:
/ee/extensions/[extensionId]/debug
- Additional entry:
Features:
- Filters:
- Extension (from URL).
- Tenant/install (dropdown or inferred from context).
- Request mode:
- “All requests” for that extension/install.
- “Specific request” by
requestId.
- Stream viewer:
- Connect/disconnect button.
- Live log panel:
- Color-coded:
- stdout (neutral),
- stderr (red),
- structured logs (level-specific colors).
- Shows timestamp and key metadata (tenant, install, req id).
- Color-coded:
- Controls:
- Pause/resume auto-scroll.
- Toggle stdout/stderr/log.
- Clear buffer.
- DX helpers:
- Show “How to correlate” help:
- e.g., “Use
request_idfrom extension errors or logs to narrow to a single flow.”
- e.g., “Use
- For dev:
- Example snippet for extension authors:
logInfo("debug marker: X")usage,- explaining how it appears in the console.
- Example snippet for extension authors:
- Show “How to correlate” help:
6. Capabilities, Flags, and Safety
To avoid accidental misuse:
- Capability gating:
- Optionally require a capability at install/manifest level:
cap:debug.logsor similar; when absent, EE refuses debug sessions for that extension except for privileged internal users.
- Optionally require a capability at install/manifest level:
- Environment flags (runner + EE):
RUNNER_DEBUG_REDIS_URLRUNNER_DEBUG_REDIS_STREAM_PREFIX- EE-side:
EXT_DEBUG_UI_ENABLED
- Rate limiting:
- EE API-level rate limits per user/tenant.
- Runner-level caps on sessions and throughput.
- Data retention:
- By design, this feature is for live debugging:
- Buffers are short-lived.
- Persistent historical logs remain in standard infra (e.g. Loki/ELK) under operator control.
- By design, this feature is for live debugging:
7. Implementation Phases
Phase 1 — Runner Event Capture
- Implement
ExtDebugEventtype and the Redis publisher in the runner.- Implemented in
ee/runner/src/engine/debug.rsanddebug_redis.rs.
- Implemented in
- Route:
- WIT log provider calls to event producer.
- Implemented in
ee/runner/src/engine/host_api.rsto forwardlog_info/log_warn/log_errorinto Redis.
- Implemented in
- WASI stderr wired to event producer (initial implementation).
- Implemented in
ee/runner/src/engine/loader.rsvia a customstderrpipe that forwards guest stderr lines into Redis when enabled.
- Implemented in
- (Optional stdout mirroring remains off by default to avoid noise; can be added later if needed.)
- WIT log provider calls to event producer.
- Add basic unit tests:
- stdout/stderr captured and tagged with correct metadata.
Phase 2 — Internal Streaming API (Legacy)
- (Deprecated) The original SSE endpoint at
/internal/ext-debug/streamhas been removed now that Redis fan-out is the canonical path. - Implement EE backend
/api/ext-debug/stream:- Implemented at
server/src/app/api/ext-debug/stream/route.ts:- AuthN + AuthZ via existing helpers.
- Forwards
extensionId/tenantId/installId/requestIdfilter to runner usingx-ext-debug-filter. - Relays SSE stream response directly to clients.
- Implemented at
- Add integration tests / local harness:
- Fake extension emitting stdout/structured logs.
- Confirm events appear via
/api/ext-debug/stream.
Phase 3 — EE Debug Console UI
- Build
/msp/extensions/[extensionId]/debugpage:- Implemented at
server/src/app/msp/extensions/[extensionId]/debug/page.tsx. - Connects to
/api/ext-debug/streamusingEventSource. - Supports filters for
tenantId,installId, andrequestId. - Renders a live console with:
- stdout/stderr/log classification,
- connection state,
- auto-scroll toggle,
- bounded history to avoid unbounded memory.
- Implemented at
- Add navigation entry points:
- Implement by linking from the extensions settings UI at
/msp/settings?tab=extensions:- For each extension row, add a "Debug Console" action targeting:
/msp/extensions/{extensionId}/debug- Optionally preserve
tenantId/installIdin query params.
- This hooks the existing settings-based extensions screen (the canonical management surface) directly into the debug page for the selected extension.
- For each extension row, add a "Debug Console" action targeting:
- Implement by linking from the extensions settings UI at
- Document how extension authors:
- Inline help on the debug page explains:
- Required runner configuration (
RUNNER_DEBUG_REDIS_URL, stream prefix, Redis ACL credentials). - Using structured logging helpers instead of printing secrets.
- Using
x-request-id/context.request_idand filters to follow specific request flows.
- Required runner configuration (
- Inline help on the debug page explains:
Phase 4 — Hardening & Production Policy
- Add capability and tenant-scoped policy checks.
- Add robust truncation, redaction (optional regex-based guardrails), and audit logs:
- Who opened debug sessions, for which extension/tenant, and when.
- Define environment policies:
- Fully enabled in dev/staging.
- In prod:
- Off by default.
- Can be enabled per tenant/extension with admin approval or for time-limited debugging.
Phase 5 — Distributed Event Bus (Redis Streams)
Motivation: In production, Knative fans requests across runner pods. A Redis-backed fan-out ensures the debug console aggregates logs across all pods and avoids the Kourier routing issues we hit with runner.msp.svc.cluster.local.
- Provision a Redis cluster/namespace dedicated to short-lived “debug events” with strong authentication and TTL defaults (e.g., 15 min retention).
- Define stream partitioning: e.g.,
ext-debug:<extension_id>or sharded bytenant_id:extension_id. Document key structure, retention policy, and serialization (JSONExtDebugEvent). - Extend the runner:
- Add optional
RUNNER_DEBUG_REDIS_URL,RUNNER_DEBUG_REDIS_STREAM_PREFIX,RUNNER_DEBUG_REDIS_MAXLEN(and future TLS/password flags) env vars so operators can point at the shared Redis cluster without changing code. - On each event, enqueue to Redis Streams (or Pub/Sub) with bounded async buffering; Redis replaces the in-memory
DebugHubentirely. - Tag events with a monotonic sequence (
xaddID) to preserve ordering. - Include metrics + back-pressure handling (drop oldest events, emit warnings) when Redis is unavailable.
- Add optional
- Reuse existing Redis stream plumbing where possible:
server/src/lib/eventBus/index.tsalready managesXADD/XREADGROUP, consumer groups, trimming, and retry logic.shared/workflow/streams/redisStreamClient.tsshows how to wrap publish/read/ack helpers.- Mirror those patterns for debug streams (new
DebugStreamClient) instead of reinventing connection management.
- Build a lightweight “debug-stream fan-out” worker (can live inside the EE server or as a sidecar) that tails Redis Streams via consumer groups, applies filters server-side, and relays to subscribers.
- Security: reuse existing
x-runner-authtoken for publishing auth, and create a dedicated Redis ACL role that only allows XADD/XLEN on the debug keys.
Phase 6 — EE Proxy Migration to Redis Streams
- Update
/api/ext-debug/streamso that, when the Redis-backed mode is enabled (EXT_DEBUG_STREAM_MODE=redis), it:- Validates the user/session as before.
- Registers/updates a consumer group per
extensionId(e.g.,ee-debug-ui). - Issues
XREADGROUPwith filters (tenantId,installId,requestId) applied server-side before emitting SSE events. - Implements heartbeats + idempotent acking so abandoned sessions don’t stall the stream.
- Add multi-tenant scoping at the stream level by embedding tenant + install IDs in stream entries and filtering at the EE layer.
- Provide fallbacks: if Redis is unavailable, drop back to the legacy per-pod proxy with an explicit warning in the UI (“live stream limited to a single runner pod”).
- Update the debug console copy to explain that live events now aggregate across all runner replicas (when Redis mode is active).
Phase 7 — Remove Per-Pod Dependency & Operability
- Once the Redis path is proven in production, disable direct
/internal/ext-debug/streamaccess from EE (keep it only for diagnostics). - Simplify runner configuration: require either Redis streaming or a dedicated
runner-privateClusterIP if Redis is disabled, so we do not rely on Kourier host matching. - Add observability:
- Metrics for stream lag, consumer group backlog, dropped events.
- Alerting when Redis retention drops events because of sustained back-pressure.
- Document upgrade/rollback steps so operators can toggle between legacy and Redis-backed streaming without dropping all sessions.
Dependencies & Coordination
- Runner team:
- Implement
ExtDebugEvent, stdout/stderr capture, DebugHub, and internal streaming API.
- Implement
- EE server/gateway team:
- Add
/api/ext-debug/streamwith proper auth. - Wire request_id propagation end-to-end.
- Add
- Platform/Infrastructure:
- Operate the Redis cluster/streams (Phase 5+) with appropriate ACLs, backups, and monitoring.
- Expose a stable internal DNS name (or service) for the runner if Redis is disabled, so EE does not depend on Kourier host headers for intra-cluster calls.
- DX/Docs:
- Update
ee/docs/extension-system/development_guide.mdand related docs to include:- How to use the debug console.
- Expected constraints and policies.
- Update
- Security/compliance:
- Review exposure model, logging content policies, and retention defaults.
Summary
This plan introduces a focused, auth-aware Extension Debug Console within EE that streams live debug events from the Wasmtime-based runner, scoped by extension/install/request. It is:
- Concrete enough to implement incrementally.
- Safe for multi-tenant environments when flags and capabilities are observed.
- Highly valuable for extension authors who need real-time visibility into their code without direct infrastructure access.