PSA/ee/docs/plans/client-extension-multitenancy-overhaul.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

885 lines
56 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Client Extension Multi-Tenancy Overhaul Plan
Last updated: 2025-08-09
Status update (2025-11-21):
- v2 extension system is live with out-of-process Runner + signed content-addressed bundles; legacy in-process/dynamic import path removed (see `extension-system-v2-migration.md`).
- UI delivery now uses Runner ext-ui host with iframe sandbox; gateway proxies all API calls to Runner `/v1/execute`.
- Remaining multi-tenant hardening tracks to the alignment plan (install_id propagation, RBAC, manifest enforcement).
## Context & Findings
- Current behavior: user-supplied extension code is uploaded into the running application environment and dynamically loaded. This violates multi-tenant isolation and increases operational risk (code execution in app context, shared process memory, filesystem access, and unrestricted egress).
- Repo state: Community Edition (CE) contains stubs; Enterprise Edition (EE) code is present under `ee/server`. The CE app dynamically imports EE initialization (`ee/server/src/lib/extensions/initialize`) when enterprise mode is enabled.
- Risk summary:
- Cross-tenant impact via shared process or host resources.
- In-process arbitrary code execution elevates the blast radius to the entire cluster.
- Unbounded capabilities: filesystem, network, and secrets likely not capability-scoped.
- Weak provenance: uploaded files lack signed, reproducible artifacts and verified dependency graphs.
## Goals
- Strong tenant isolation for compute, storage, cache, and network.
- No direct execution of tenant-supplied code in the application process.
- Capability-based, least-privilege runtime with explicit allowlists.
- Deterministic, reproducible, and signed extension artifacts.
- Auditable execution with traceability, quotas, and rate limits per tenant.
- Backwards-compatible migration path, with clear deprecation of unsafe paths.
## Overarching Phases
Phase 1 — Static Rendering via Rust Host (MinIO proxy)
- Scope: Serve prebuilt UI bundles (iframe apps) as immutable static assets via a Rust host that proxies reads from MinIO/S3, with strict path sanitation, tenant/contentHash validation, ETag/Cache-Control, and pod-local caching optional.
- Purpose: Quickly replace any dynamic module loading in the app with safe, static delivery. No guest code execution. Focus on asset integrity and isolation.
- Deliverables:
- Rust static asset service (MinIO/S3 proxy) with SPA fallback and CSP guidance for iframes
- URL model: /ext-ui/{extensionId}/{content_hash}/... mapped to object storage layout (sha256/<hash>/ui/...)
- Basic registry/install wiring to resolve content_hash per tenant (read-only for UI)
- Signing/hash verification for assets at fetch time (optional signature; hash required)
- Docs + Client SDK usage for iframe embedding
Phase 2 — Dynamic WASM Features
- Scope: Out-of-process Runner (Rust + Wasmtime), Host API v1 (capability-based), Next.js API gateway to Runner, event-driven execution, quotas/limits, and per-tenant auditability.
- Purpose: Safely execute extension logic outside the app process with strong isolation and provenance.
- Deliverables:
- Runner service with Wasmtime limits, host imports, and signature verification
- Registry + bundle signing/publishing, versioning, and warmup/prefetch
- API gateway for /api/ext/... to invoke handlers in Runner
- Event subscriptions, logs/metrics, idempotency, and quota enforcement
Mapping to detailed sections
- Phase 1 aligns with: "Client UI Delivery (iframe-only)", "Client Asset Serving via Gateway", and parts of "Bundle Storage Integration" focused on static ui assets and integrity.
- Phase 2 aligns with: "Runner Service Design", "HTTP Routing for Plugin Endpoints", "Next.js API Router/Proxy", "Runtime Decision: Wasmtime", and remaining bundle signing/execute paths.
## Non-Goals (for this overhaul)
- Supporting all languages. Start with JS/TS to WASM or isolate; consider additional languages later.
- Full “bring-your-own container” marketplace. We will support a controlled out-of-process path, but not arbitrary images at first.
## Upfront Decisions (Simplifications)
- EE-only: Extensions ship only with Enterprise Edition; no feature flag toggle needed in CE. Remove extension initialization paths in non-EE builds.
- Runtime: Standardize on Wasmtime-based wasm_runner only; no alternate runtimes.
- Storage: Use S3-compatible storage via our existing S3StorageProvider against local MinIO only. No alternative providers. Canonical bucket and prefix are defined via env.
- UI: Iframe-only Client SDK approach. React-based example and docs only for SDK; no descriptor renderer.
- Fetch/serve model: Object storage is source of truth. Pods fetch bundles/UI on-demand into a pod-local cache and serve directly via Next.js/Knative.
- Framework: Use Axum 0.7 + tower-http for the unified Rust application server. Static asset routes (/ext-ui/...) and execute routes (/v1/execute) live in the same binary. This keeps Phase 1 minimal and allows Wasmtime to be bolted in for Phase 2 without changing frameworks. See [ee/runner/src/http/server.rs](ee/runner/src/http/server.rs:1) and dependency updates in [ee/runner/Cargo.toml](ee/runner/Cargo.toml).
## Executive Summary
We are splitting the extension overhaul into two phases: Phase 1 focuses on safe, static UI delivery via a Rust host proxying MinIO/S3 (no dynamic module loading, no guest code execution), and Phase 2 delivers dynamic WASM execution with a Rust Runner (Wasmtime), a capability-based Host API, and a Next.js API gateway. This preserves security and isolation while enabling a clear migration path.
## Server Actions-First Contract
- Principle: Business logic lives in server actions under `server/src/lib/actions` (EE overlays may live under `ee/server/src/lib/actions`). HTTP API routes exist only as thin wrappers that call these actions to support external/infra consumers (Runner, automation).
- Actions (conceptual names) and wrappers:
- `extensions.publishVersion(bundle)` → verifies, computes `content_hash`, writes to `sha256/<hash>/bundle.tar.zst`, records `extension_bundle`. Wrapper: `POST /api/extensions/:id/versions`.
- `installs.createOrEnable(tenant, extension, version)` → persists install, computes `runner_domain`, sets `runner_status='pending'`, enqueues provisioning workflow. Wrapper: `POST /api/installs` or server-initiated only.
- `installs.lookupByHost(host)` → returns `{ tenant_id, extension_id, content_hash }`. Wrapper: `GET /api/installs/lookup-by-host` (used by Runner).
- `installs.validate(tenant, extension, hash)` → returns `{ valid: boolean }`. Wrapper: `GET /api/installs/validate` (used by Runner `ext-ui` gate).
- `installs.reprovision(installId)` → retries provisioning (Temporal). Wrapper: `POST /api/installs/:id/reprovision`.
- Testing guidance: unit/integration tests target server actions; API tests cover parameter parsing and delegation only.
## Proposed Document Map
Unified service approach
- We will deploy a single Rust application server that serves both static assets (/ext-ui/...) and the execute API (/v1/execute). CDN fronts /ext-ui with immutable caching by contentHash. Route-level isolation and config separation keep static and execute concerns safe within one binary.
- Phase 1 — Static Rendering via Rust Host (MinIO proxy)
- See: Phase 1 section below. Consolidates: "Client UI Delivery (iframe-only)", "Client Asset Serving via Gateway", and the UI-asset portions of "Distributed Bundles, Assets, and Caching".
- Phase 2 — Dynamic WASM Features
- See: Phase 2 section below. Consolidates: "Runner Service Design (Rust + Wasmtime)", "HTTP Routing for Plugin Endpoints", "Next.js API Router/Proxy", "Runtime Decision: Wasmtime", and WASM/precompiled portions of caching.
- Shared Foundations
- See: Data Model and Registry section. Consolidates: "Data Model (initial)" and "Public APIs (EE)".
## Phase 1 — Static Rendering via Rust Host (MinIO proxy)
Scope & Objectives
- Serve prebuilt iframe UI bundles as immutable static assets from MinIO/S3 via a Rust host. Validate tenant/contentHash; sanitize paths; set strong caching and security headers. No dynamic JS import into host app.
Architecture
- Implementation: Served by the unified Rust application server within a dedicated route group (/ext-ui/...)
- URL model: /ext-ui/{extensionId}/{contentHash}/[...path]
- Object storage layout: sha256/<hash>/ui/**/* (extracted from bundle) or tar subtree on first touch; integrity via contentHash
- Caching: CDN as primary (immutable by contentHash); pod-local cache optional/minimal for origin efficiency; SPA fallback to index.html
Security
- Tenant/contentHash validation with registry lookups
- Path sanitization, file size caps, immutable caching, ETag/If-None-Match
- CSP for iframes (summary; full guidance in Appendix A)
Deployment & Operations
- Env: EXT_BUNDLE_STORE_URL, STORAGE_S3_*, EXT_CACHE_*, EXT_STATIC_STRICT_VALIDATION; health checks; metrics; autoscaling profile
- CDN: front /ext-ui with long-lived immutable caching keyed by full path; origin shielding to reduce S3 reads
Test Plan
- Unit/integration for sanitization, 404/304/200 paths, cache eviction, large file handling; load tests for warm/cold cache; S3 failure modes
References to detailed content in this doc
- Client UI Delivery (iframe-only with SDK)
- Client Asset Serving via Gateway (pod-local cache)
- Distributed Bundles, Assets, and Caching (UI aspects)
### Phase 1 — TODOs (Status)
1.a Client Asset Fetch-and-Serve (Pod-Local Cache)
- [x] Route: `server/src/app/ext-ui/[extensionId]/[contentHash]/[...path]/route.ts` (GET).
- [x] Cache manager: `server/src/lib/extensions/assets/cache.ts` (ensure and basic index write).
- [x] Static serve: `server/src/lib/extensions/assets/serve.ts` (SPA fallback; sanitize; caching headers).
- [x] Mime map: `server/src/lib/extensions/assets/mime.ts`.
- Details
- [x] Tar/zip extraction for `ui/**/*`.
- [x] LRU index file structure recorded; [x] eviction policy and GC.
- [x] ETag generation and conditional GET support.
- [x] Locking/concurrency control for first-touch extraction.
- [x] Enforce tenant/contentHash match (404 on mismatch) in route handler.
- [ ] CSP guidance for iframe pages.
1.b Client SDK (Iframe)
- [x] Packages created: `ee/server/packages/extension-iframe-sdk/`, `ee/server/packages/ui-kit/`.
- SDK files
- [x] `src/index.ts`, [x] `src/bridge.ts`, [x] `src/auth.ts`, [x] `src/navigation.ts`, [x] `src/theme.ts`, [x] `src/types.ts`, [x] React hooks (`src/hooks.ts`), [x] README with React example and security guidance.
- UI Kit
- [x] `src/index.ts`, [x] theme tokens CSS and theming entry, [x] MVP components, [x] hooks, [x] README (tokens + usage updated).
- Example app
- [x] Vite + TS example (under `ee/server/packages/extension-iframe-sdk/examples/vite-react/`) with README and static build output.
- Host bridge bootstrap
- [x] `ee/server/src/lib/extensions/ui/iframeBridge.ts` to inject theme tokens and session.
- Protocol & security
- [x] Origin validation and sandbox attributes; author docs.
- [x] Message types include `version`.
- Ergonomics
- [x] React hooks: `useBridge`, `useTheme`, `useAuthToken`, `useResize`.
1.c Bundle Storage Integration (UI integrity)
- Details
- [x] Hash verification on fetch and before use.
- Archive integrity: archive sha256 is verified against the URL content-address (sha256/<hex>/bundle.tar.zst) during download. On mismatch, the request returns 502 (code: archive_hash_mismatch) and nothing is cached.
- Per-file integrity: on every GET, a strong ETag is computed from the served file bytes using SHA-256 and returned as a quoted value: "sha256-<hex>". If the client supplies If-None-Match with this exact value, the server returns 304.
- Operational note: URLs include the contentHash making CDN caching safe and immutable; origin fails closed on integrity mismatches and never serves partially extracted assets.
1.d Unified Rust Static Asset Host (MinIO/S3 proxy)
- Routing
- [ ] Add GET route group in [ee/runner/src/http/server.rs](ee/runner/src/http/server.rs:1): `/ext-ui/{extensionId}/{contentHash}/*path`
- [ ] Implement SPA fallback: serve `index.html` when file missing or path is a directory; honor `?path=/...` for client router hydration
- [ ] Strict path sanitation: reject `..`, absolute paths, and illegal chars; normalize and ensure access remains within cache root
- Framework and dependencies
- [ ] Framework: continue with Axum 0.7; add tower-http layers/services to simplify static hosting
- [ ] Use `tower_http::services::ServeDir` for on-disk cache under `${EXT_CACHE_ROOT}/{hash}/ui/`; wrap with a custom handler for tenant/contentHash validation and SPA fallback
- [ ] Add `mime_guess` for content-type mapping
- [ ] Keep `reqwest` S3-compatible HTTP via `BUNDLE_STORE_BASE`; optionally switch to `aws-sdk-s3` if Range/HEAD origin features are required
- [ ] Update [ee/runner/Cargo.toml](ee/runner/Cargo.toml:1) with:
- `tower-http = "0.5"` features ["fs","compression","set-header","trace"]
- `mime_guess = "2"`
- `tar = "0.4"` and `zstd = "0.13"` (or `async-compression` with zstd feature)
- optional `aws-sdk-s3 = { version = "1", features = ["rustls"] }`
- Registry/contentHash validation
- [ ] Add lightweight registry validation client (HTTP or DB per deployment) to confirm tenant install → version → `content_hash` before serving
- [ ] On mismatch or missing install/version, return 404 and never serve from cache
- [ ] Short TTL (3060s) cache for registry lookups keyed by `{tenant_id, extension_id, content_hash}`
- Object storage integration
- [ ] Extend [ee/runner/src/engine/loader.rs](ee/runner/src/engine/loader.rs) with `fetch_object_range()` and `fetch_to_file()` helpers for large reads
- [ ] Fetch bundle archive and extract only `ui/**/*` into cache on first touch
- [ ] Enforce layout `sha256/<hash>/ui/**/*` and verify `sha256` during extract (per-file or archive-level validation)
- Pod-local cache
- [ ] Introduce [ee/runner/src/cache/fs.rs](ee/runner/src/cache/fs.rs) with helpers to:
- compute cache paths under `${EXT_CACHE_ROOT}/<hash>/ui/...`
- write files atomically (temp + rename)
- set read-only permissions after write
- [-] Implement capacity-based LRU eviction (bytes and/or file-count) reusing [ee/runner/src/cache/lru.rs](ee/runner/src/cache/lru.rs) -- DELAY
- [-] Background GC task and on-demand eviction on put; record cache index with last-access timestamps -- DELAY
- Headers and correctness
- [ ] Content-Type mapping by extension (fallback `application/octet-stream`)
- [ ] `Cache-Control: public, max-age=31536000, immutable` (URLs are content-hash addressed)
- [ ] ETag generation from file content; support `If-None-Match` → 304
- [ ] Optional range requests: `Accept-Ranges`, 206 `Content-Range` for large assets - DELAY
- [ ] File size caps and response size caps; return 413/416 as appropriate
- Security
- [ ] Enforce tenant/contentHash validation before any serve; never trust URL alone
- [ ] Disallow directory traversal and hidden files; consider allowlist of extensions (html, js, css, json, map, svg, png, jpg, webp, woff, woff2)
- [ ] CSP guidance for iframe pages; document default CSP and sandbox attributes
- Configuration and ops
- [ ] Env: `BUNDLE_STORE_BASE`, `STORAGE_S3_*`, `EXT_CACHE_ROOT`, `EXT_CACHE_MAX_BYTES`, `EXT_STATIC_STRICT_VALIDATION`, `EXT_STATIC_MAX_FILE_BYTES`
- [ ] Enhance `/healthz` in [ee/runner/src/http/server.rs](ee/runner/src/http/server.rs:1) to check cache dir writable and object store reachable (HEAD on bucket/prefix)
- [ ] `/warmup` supports prefetch of `{contentHash}` UI subtree into cache
- [ ] Structured tracing fields on serve: `request_id`, `tenant`, `extension`, `content_hash`, `file_path`, `status`, `duration_ms`, `cache_status` (hit/miss)
- Tests
- [ ] Unit: path sanitizer; content-type mapper; ETag calc; cache LRU; extract-only-UI correctness
- [ ] Integration: cold fetch → extract → 200; repeat with `If-None-Match` → 304; tenant/contentHash mismatch → 404; large file → 413; traversal attempts → 400/404
- Docs
- [ ] Update Client SDK README to reference iframe `src="/ext-ui/{extensionId}/{content_hash}/index.html?path=/..."` and CSP/sandbox guidance
1.e Bundle Format Alignment (zstd)
- Rationale
- Uploader/finalizer and authoring tooling standardize on `bundle.tar.zst` (zstd-compressed tar).
- Runner must align on the same artifact name and compression to avoid format mismatches.
- Tasks
- [x] Runner: change bundle URL to `sha256/<hex>/bundle.tar.zst` in `ee/runner/src/engine/loader.rs::bundle_url()` and any hard-coded paths.
- [x] Runner: replace gzip decoding with zstd decoding in `ee/runner/src/http/ext_ui.rs` (use `zstd::stream::read::Decoder` or `async-compression` zstd reader) for UI extraction.
- [x] Runner: update temporary file naming in `verify_archive_sha256()` to `.tar.zst` for clarity (no functional change required).
- [x] Tests: update `ee/runner/tests/ext_ui_integration.rs` to generate `.tar.zst` bundles and serve `/sha256/:hex/bundle.tar.zst` in the in-memory server.
- [x] Cargo: add `zstd = "^0.13"` (or enable zstd in `async-compression`) and remove the `flate2` dependency if no longer needed.
- [x] Docs: ensure all references in this plan and related docs use `bundle.tar.zst` consistently.
1.f Per-Extension App Domains (Knative)
- Rationale
- Assign a dedicated app domain per tenants extension install so Knative can autoscale the Runner on host hits and we have clean, predictable URLs.
- Keep a single Runner KService; provision a DomainMapping per extension install that targets that KService.
- Data model
- [x] Add columns to `tenant_extension_install`:
- `runner_domain` (text, unique, indexed)
- `runner_status` (jsonb; { state: 'pending'|'provisioning'|'ready'|'error', message?, last_updated? })
- `runner_ref` (jsonb; optional: KService/DomainMapping identifiers for troubleshooting)
- [x] Config: `EXT_DOMAIN_ROOT` (e.g., `ext.example.com`) and domain pattern `<t8>--<e8>.<EXT_DOMAIN_ROOT>` where:
- `t8` = first 8 hex chars if `tenantId` is UUID-like, else first 12 slug chars
- `e8` = first 8 hex chars if `extensionId` is UUID-like, else first 12 slug chars
- Rationale: ensures DomainMapping `metadata.name` stays within 63-char limit.
- Provisioning (Option B: Temporal worker)
- [x] Create provisioning workflow in Temporal (ee/temporal-workflows/src/worker.ts task queue):
- Activity: `computeDomain(tenantId, extensionId, EXT_DOMAIN_ROOT)` returns domain string.
- Activity: `ensureDomainMapping({ domain, kservice, namespace })` uses Kubernetes API to create DomainMapping:
- `apiVersion: serving.knative.dev/v1beta1`, `kind: DomainMapping`, `metadata.name: <domain>`
- `spec.ref: { apiVersion: 'serving.knative.dev/v1', kind: 'Service', name: <runner-kservice> }`
- Update DB status: set `runner_status.state` to `provisioned` or `error` with message.
- [x] Trigger workflow on install.
- [ ] Trigger workflow on enable.
- [x] Expose a “reprovision domain” action to retry.
- [ ] RBAC/secret: ServiceAccount with permission to manage DomainMappings in the Runner namespace.
- Server (Next.js)
- [x] Server actions-first:
- `installs.createOrEnable(...)` computes `runner_domain`, persists `runner_status='pending'`, enqueues Temporal provisioning.
- `installs.lookupByHost(host)``{ tenant_id, extension_id, content_hash }` (resolves latest bundle by domain).
- `installs.validate(tenant, extension, hash)``{ valid: boolean }` (strict ext-ui gating).
- [x] Expose thin API wrappers that delegate to actions:
- `GET /api/installs/lookup-by-host?host=...`
- `GET /api/installs/validate?tenant=...&extension=...&hash=...`
- `POST /api/installs/:id/reprovision` (calls `installs.reprovision`).
- Runner changes
- [x] GET `/` host entry: read Host header, call `REGISTRY_BASE_URL/api/installs/lookup-by-host?host=...` (with short TTL cache), 302 → `/ext-ui/{extensionId}/{content_hash}/index.html`.
- [x] Keep ext-ui strict validation as-is (host lookup is just a dispatcher).
- UI updates
- [x] Extensions list/details: display `runner_domain`, status (pending/provisioned/error), copy/open links.
- [x] Add action to reprovision if status=error.
- Ops
- [ ] Wildcard DNS `*.${EXT_DOMAIN_ROOT}` → Knative ingress (or automate DNS records per domain).
- [x] KService env/secrets documented: `BUNDLE_STORE_BASE`, `REGISTRY_BASE_URL`, `EXT_CACHE_MAX_BYTES`, `EXT_STATIC_STRICT_VALIDATION`, `EXT_EGRESS_ALLOWLIST`, S3 creds. See `ee/docs/extension-system/knative-app-domains.md`.
- Failure modes & handling
- [ ] On provisioning failure: persist error in `runner_status`, surface in UI, provide retry.
- [x] On lookup miss: Runner returns 404.
- [ ] Audit install-to-domain mapping (log/metrics on lookup miss).
### Install Provisioning — State Diagram
```mermaid
stateDiagram-v2
[*] --> Pending: Install created/enabled
Pending --> Provisioning: Enqueue Temporal workflow\nensureDomainMapping
Provisioning --> Ready: DomainMapping applied\nupdate runner_status=ready
Provisioning --> Error: Provisioning failure\nupdate runner_status=error
Error --> Provisioning: Reprovision action\nretry workflow
Ready --> Ready: New version published\ncontent_hash updates via lookup
Ready --> Provisioning: Reprovision action
note right of Ready: Host traffic → Runner\nGET / → lookup-by-host → 302 /ext-ui/.../index.html
```
## Phase 2 — Dynamic WASM Features
Implementation note
- Phase 2 routes (/v1/execute) are served by the same unified Rust application server. The Wasmtime engine, egress allowlists, and secrets are only wired into the execute route group; static routes remain read-only and do not mount runner secrets.
Scope & Objectives
- Out-of-process execution with Rust Runner (Wasmtime), capability-based Host API, Next.js API gateway, events, quotas, provenance (signed bundles).
Architecture
- Runner Service Design (Rust + Wasmtime)
- HTTP Routing for Plugin Endpoints and API gateway
- Runtime Decision: Wasmtime (WASM-only)
- Distributed Bundles and Caching (WASM/precompiled aspects)
Security & Isolation
- Resource limits, egress allowlists, secrets brokering, audit logs, idempotency
Deployment & Operations
- Knative Serving profile, autoscaling, warmup/precompile
Test Plan
- Execute API behavior, policy enforcement, quotas, error codes, telemetry
References to detailed content in this doc
- Runner Service Design (Rust + Wasmtime)
- HTTP Routing for Plugin Endpoints
- Next.js API Router/Proxy (design)
### Phase 2 — TODOs (Status)
2.a Database Schema and Registry Services
- [x] Migrations (EE): create base tables
- [x] `extension_registry`
- [x] `extension_version`
- [x] `extension_bundle` (includes `precompiled` map)
- [x] `tenant_extension_install`
- [x] `extension_event_subscription`
- [x] `extension_execution_log`
- [x] `extension_quota_usage`
- [ ] RLS plan and enforcement for tenant-scoped tables
- [x] Registry service scaffold (`ee/server/src/lib/extensions/registry-v2.ts`).
- [x] Tenant install service scaffold (`ee/server/src/lib/extensions/install-v2.ts`).
- [x] Signature verification util (stub) in `server/src/lib/extensions/signing.ts`.
- [ ] Admin CLI for publish/deprecate/install flows.
- Details
- [x] PK/FK relationships and cascade deletes confirmed in migrations.
- [x] Indexes: `execution_log (tenant_id, created_at)`, `event_subscription (tenant_id, topic)`, `tenant_install (tenant_id)`.
- [ ] Consider `extension_id` normalization vs. `registry_id` lookups.
2.b Bundle Storage Integration (signing and precompiled)
- [x] EE S3 provider implemented against MinIO (scaffold).
- [x] CE bundle helpers added in `server/src/lib/extensions/bundles.ts` (placeholders for EE wiring).
- [x] Precompiled cwasm support in schema (DB) and manifest; [ ] runtime selection logic in loader.
- Details
- [x] Canonical content-address layout documented.
- [ ] Signature format decision and trust bundle format.
- [ ] Signature verification: runner mandatory; gateway optional.
2.c Runner Service (Rust + Wasmtime)
- [x] Runner crate scaffolding: `Cargo.toml`, `src/main.rs`, `src/http/server.rs` (`POST /v1/execute`), `src/models.rs`.
- [x] Engine/loader/cache modules created (placeholders).
- Wasmtime configuration
- [x] Engine/Config: async enabled, epoch_interruption on
- [x] PoolingAllocationConfig with conservative caps
- [x] Static/dynamic guard sizes; static max size set
- [x] Store limits: custom ResourceLimiter and Store.limiter installed
- [x] Timeouts: epoch-based deadline mapped from timeout_ms with background engine.increment_epoch
- [ ] Fuel: optional fuel metering toggle and budgeting (currently disabled)
- Host imports (alga.*)
- Logging
- [x] alga.log_info(ptr,len)
- [x] alga.log_error(ptr,len)
- HTTP
- [x] alga.http.fetch(req_ptr,req_len,out_ptr) async via reqwest
- [x] EXT_EGRESS_ALLOWLIST enforcement (exact/subdomain host match)
- [ ] Limits/policy: size/time caps; header allowlist; method/body policy
- Storage (KV/doc)
- [ ] alga.storage.* (API design + stubs)
- Secrets
- [ ] alga.secrets.get (API design + stubs)
- Metrics/observability
- [ ] alga.metrics.* (counters/timers) or host-collected hooks
- Module fetch/cache from S3
- Source
- [x] Fetch via BUNDLE_STORE_BASE + content-addressed key
- Caching
- [x] In-memory per-process cache (HashMap)
- [ ] Pod-local LRU with capacity limits (disk/mem)
- Integrity
- [x] SHA-256 verification against key path (sha256/<hash>/…)
- [ ] Signature verification using SIGNING_TRUST_BUNDLE (deferred)
- Precompiled
- [ ] Precompiled module fetch/use (optional), keyed by hash+target
- Execute flow
- Input handling
- [x] Normalize ExecuteRequest → guest input JSON (context + http)
- [x] Idempotency cache (in-memory) based on x-idempotency-key
- [ ] Additional validation of method/path/header/body limits
- Instantiate
- [x] Engine/Store with limits + linker imports
- ABI call
- [x] Require guest exports: memory, alloc, handler(req_ptr, req_len, out_ptr)
- [x] Optional dealloc support
- [x] Read resp tuple (ptr,len) → bytes
- Response
- [x] Parse as normalized response JSON {status, headers, body_b64}
- [x] Fallback: if not JSON, base64 opaque bytes
- Logging/metrics
- [x] Start/end logging with request_id, tenant, extension, status
- [x] duration_ms, resp_b64_len, configured timeout/mem
- [ ] Counters/histograms (egress bytes, status code buckets), per-tenant metrics
- [ ] Structured error codes mapping
- [ ] Errors/tests: standardized error codes + unit/integration tests.
- [x] Containerization: `ee/runner/Dockerfile` and KService YAML with `/healthz` and `/warmup`.
- Details
- [ ] Observability: tracing fields and metrics; persist execution logs.
- [x] Idempotency handling with gateway-provided key.
2.d Next.js API Gateway for Server-Side Handlers
- [x] Route added: `server/src/app/api/ext/[extensionId]/[...path]/route.ts` (GET/POST/PUT/PATCH/DELETE).
- [x] Helpers: `auth.ts`, `registry.ts`, `endpoints.ts`, `headers.ts` (scaffolds).
- [ ] Request policy
- [x] Header allowlist (strip `authorization`).
- [x] Body size caps.
- [x] Timeout via `EXT_GATEWAY_TIMEOUT_MS`.
- [ ] Proxy and telemetry
- [x] Proxy to Runner `/v1/execute` with normalized payload.
- [x] Map response back to client.
- [ ] Emit telemetry (tracing/metrics).
- Details
- [ ] AuthN/Z: derive tenant from session/API key; enforce RBAC. (Scaffolding present in `server/src/lib/extensions/gateway/auth.ts`; production wiring pending.)
- [x] Idempotency key for non-GET; [ ] retry policy (502/503/504 with jitter).
- [x] Propagate `x-request-id`; record correlation IDs.
- [ ] Normalize `user-agent`.
- [x] Resolve `version_id → content_hash` via `extension_bundle` join in gateway helpers (`registry.ts`).
2.e Knative Serving (Runner)
- [x] KService manifest with autoscaling annotations.
- [x] `/healthz` and `/warmup` endpoints implemented.
- [ ] CI/CD step to build/publish runner and smoke-test `/v1/execute`.
- Details
- [ ] Autoscale tuning; resource requests/limits aligned to memory caps.
- [ ] Warmup prefetch strategy for hot bundles.
- [ ] Rollout notes for revision updates.
- Runtime Decision: Wasmtime (WASM-only)
## Data Model and Registry (Shared Foundations)
- Consolidates: Data Model (initial) and Public APIs (EE)
- Used by Phase 1 for read-only UI delivery (install → version → content_hash)
- Used by Phase 2 for full execution, logging, and quotas
## Proposed Architecture
WASM-only runner model:
1) Out-of-Process Runner (single runtime path)
- Execute all extensions in an external Runner Service using a WASM runtime with a strict, capability-based Host API.
- No direct filesystem access; no raw network access. All I/O occurs through brokered host functions that enforce tenant- and capability-scoped policies.
- Deterministic execution with configurable timeouts, memory limits, and concurrency controls per tenant/extension.
2) Signed, Reproducible Bundles
- Extensions are packaged as immutable bundles (content-addressed by SHA256) with a manifest and lockfile.
- Build pipeline compiles/transpiles and freezes dependencies; no dynamic require/import at runtime.
- Bundles stored in object storage (e.g., S3/GCS) and verified by signature on install and on load.
3) Capability-Based Host API (stable, versioned)
- Minimal surface: events, HTTP fetch via broker, key-value/doc store, scheduled tasks, secrets, and logging/metrics.
- Explicit grants recorded per tenant install (manifest + admin approvals). All calls carry `tenant_id` and `extension_id`.
- Timeouts, memory/cpu quotas, and concurrency limits enforced by the runner.
4) Event-Driven Execution
- Core app publishes events (domain, data changes, schedules) to an event bus.
- Registry maps tenant subscriptions to installed extension entrypoints.
- Runner pulls events, resolves bundle, executes handler in isolated sandbox, and reports result/metrics.
5) UI Extension Sandboxing
- UI integrates exclusively via sandboxed iframes powered by the Alga Extension Client SDK.
- Enforce strict CSP, postMessage bridge, and explicit allowlists for APIs and assets.
- UI assets are served from signed bundles or CDN; no runtime code injection into the host app.
### Components
- Extension Registry: catalogs extensions, versions, capabilities, and maintainers.
- Tenant Install Store: per-tenant install with granted capabilities, secrets, and config.
- Bundle Storage: object storage for signed, content-addressed bundles.
- Build Service: validates, compiles, and signs bundles (CI-integrated and/or hosted).
- Runner Service: isolated execution engine with quotas, metrics, and audit logs (implemented with Wasmtime).
- Host API Broker: mediates storage, network egress, secrets, and queues; enforces policy.
- Event Bus: routes events and schedules executions.
- UI Host: renders UI extensions using sandbox constraints.
### Distributed Bundles, Assets, and Caching (multi-pod safe)
- Object storage as source of truth: All extension bundles and UI assets live in object storage using content-addressed paths (`sha256/<hash>`). No persistent host volumes across pods.
- Pod-local caches: Runner and API pods maintain small ephemeral LRU caches on local disk/memory. On first request for a given `content_hash`, the pod pulls only the needed artifacts (WASM and/or `ui/**/*`) into its local cache.
- Optional prefetch: On pod startup or install/upgrade events, selectively prefetch hot bundles/UI to reduce first-request latency.
- No app-managed CDN or signed URLs: Assets are served directly from the pod over Knative Serving once cached locally.
- Precompiled module cache: Store optional precompiled Wasmtime artifacts in object storage; pods fetch on demand and keep an ephemeral cache per target triple. Validate hash on use.
- GC policy: Capacity-based eviction (e.g., max N GB or file count) with background GC to remove least-recently-used artifacts.
- Consistency & integrity: Content-hash directory layout ensures deterministic assets. Verify signatures for bundles before use; verify file hashes when extracting.
### Runner Service Design (Rust + Wasmtime)
- Embedding: Rust service embedding Wasmtime with PoolingAllocator; Store limits configured for memory/tables.
- Invocation API: Internal gRPC/HTTP accepting `tenant_id`, `extension_id`, `version_id`, `content_hash`, `entry`, `input`, and idempotency key. Runner fetches module artifacts, verifies signature, instantiates, and executes.
- Host imports (capabilities): Namespaced imports `alga.*` for storage, http, secrets, events, logging. All calls scope to tenant/extension and enforce quotas and egress policy. No preopened FS; no ambient WASI.
- Resource controls: Per-invocation memory caps, epoch timeouts, optional fuel metering; concurrency throttles per tenant/extension. Hard stop on policy violations with structured errors.
- Event integration: Pull from event bus/queue with per-tenant partitions; support push-based execution for admin test-runs.
- Observability: Structured logs with correlation IDs, metrics (duration, mem, fuel, egress), and tracing.
- Failure handling: Retries via idempotency; quarantine misbehaving extensions; circuit breakers for upstream/broker failures.
### Client UI Delivery (iframe-only with SDK)
- Iframe-only UI: Extensions ship prebuilt static apps (e.g., React/Vite build). On first request, the API pod pulls the `ui/**/*` subtree for the installed `content_hash` into a pod-local cache and serves assets directly.
- Client SDK: Provide `@alga/ui-kit` and `@alga/extension-iframe-sdk` for consistent components, theming, a11y, and a postMessage bridge (auth, navigation, theme tokens, telemetry, viewport sizing).
- Theming: Host propagates design tokens to the iframe via the bridge; UI Kit consumes CSS variables for live theme updates.
- Security: Sandbox iframes (`allow-scripts` by default; add `allow-same-origin` only if needed by SDK). All API calls go through `/api/ext/...` gateway. Prevent directory traversal in asset serving.
### Client Asset Serving via Gateway (pod-local cache)
- Entry route: `server/src/app/ext-ui/[extensionId]/[contentHash]/[...path]/route.ts` (GET)
- Resolves tenant install → `content_hash` (the URLs `[contentHash]` must match; otherwise 404) to avoid serving stale assets.
- Ensures `ui/**/*` for `[contentHash]` exists in the pod-local cache directory, otherwise pulls and extracts just the `ui` subtree from the bundle archive.
- Serves files from `<CACHE_ROOT>/<contentHash>/ui/` with SPA fallback to `index.html` when `path` is missing or not found.
- Sets headers: `Cache-Control: public, max-age=31536000, immutable` because `contentHash` makes URLs immutable; adds `ETag` based on file hash; sets content-type by extension.
- Iframe src: Host pages set iframe `src="/ext-ui/{extensionId}/{content_hash}/index.html?path=/desired/route"`.
- Safety: Sanitize path, disallow `..` segments, and restrict to the cached directory. Limit individual file size and total cache size.
### Knative Serving Profile (initial)
- Serving only (no Eventing initially). The unified Rust application server ships as a Knative Service (KService) to leverage revisioning and concurrency-based autoscaling. It exposes both /ext-ui (static) and /v1/execute (execute) routes.
- Autoscaling metric: concurrency. Configure `containerConcurrency` (e.g., 416 depending on per-invocation memory) and use the Knative Pod Autoscaler (KPA) with a simple target concurrency (e.g., 10) as a starting point. Final SLOs/policies to be tuned later.
- Scale policy: keep `minScale` configurable (0 for non-critical, 1+ for production to reduce cold starts). Set `maxScale` to cap cost. Revisions roll out code safely; extension versions are handled at the bundle layer, not via Knative revisions. Prefer CDN to absorb /ext-ui traffic so autoscaling is driven by execute workloads.
- Probes and warmup: add a warmup endpoint to prefetch common bundles and initialize Wasmtime; use readiness probes that succeed only after caches are primed if needed.
- Security: run under a restricted ServiceAccount with egress policies; use Kubernetes secrets for broker credentials and object store credentials. Static routes do not require runner secrets; ensure secret mounts are scoped to execute path usage.
Example KService (abridged):
```
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: alga-ext-runner
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/metric: concurrency
autoscaling.knative.dev/target: "10"
# Optional, tune later
autoscaling.knative.dev/minScale: "1"
autoscaling.knative.dev/maxScale: "50"
spec:
containerConcurrency: 8
containers:
- image: ghcr.io/alga/runner:sha-<image>
env:
- name: BUNDLE_STORE_BASE
value: https://s3.example.com/alga-ext/
- name: SIGNING_TRUST_BUNDLE
valueFrom:
secretKeyRef: { name: runner-secrets, key: trust.pem }
- name: RUNTIME_LIMITS
value: '{"memory_mb":512,"timeout_ms":5000,"fuel":null}'
ports:
- containerPort: 8080
```
### On-Demand Loading, Versioning, and Hot Swap
- Lazy load: Resolve the tenants installed extension version on each request; fetch the bundle by `content_hash` from object storage if not cached; verify signature; instantiate per-invocation.
- Caching: Maintain in-pod LRU caches for raw WASM and precompiled artifacts keyed by `content_hash+target`. Validate hashes on every use. Optionally cache resolved handler maps per extension version.
- Version updates: Tenant install updates change the `version_id → content_hash` mapping in the registry. Subsequent requests pick up the new `content_hash` automatically (cache miss → fetch new). In-flight requests continue on the old version; no pod restarts required.
- Warmup: On install/upgrade, optionally push a warmup signal to prefetch and precompile hot bundles on a subset of Runner pods.
- Consistency: Use strong consistency on registry lookups or include `content_hash` in the gateways dispatch token so the Runner executes the intended version even amid concurrent upgrades.
### HTTP Routing for Plugin Endpoints
- Gateway pattern: The core app exposes stable API paths and forwards plugin requests to the Runner. Proposed pattern: `/api/ext/{extensionId}/{...path}` with tenant context inferred from auth/session.
- Manifest mapping: Manifest v2 defines API endpoints (method, path template, handler). The gateway resolves `{extensionId, method, path}` to a handler name within the bundle and calls Runner Execute with the request payload and headers.
- AuthZ and quotas: The gateway enforces user authN/RBAC and per-tenant rate limits before invoking Runner. The Runner still enforces capability-level checks and per-tenant execution quotas.
- Contract: Runner HTTP execute endpoint accepts `method`, `path`, `query`, `headers`, and `body` plus context (tenant_id, extension_id, content_hash), returning `status`, `headers`, and `body`. Inside WASM, the handler receives a normalized request object and returns a normalized response.
### Next.js API Router/Proxy (design)
- Route structure: `server/src/app/api/ext/[extensionId]/[...path]/route.ts`
- Methods: Support GET, POST, PUT, PATCH, DELETE. All methods follow the same pipeline.
- Env/config: `RUNNER_BASE_URL`, `BUNDLE_STORE_BASE`, `SIGNING_TRUST_BUNDLE`, `EXT_GATEWAY_TIMEOUT_MS`.
Request pipeline (per request):
- Resolve tenant: derive `tenant_id` from session/auth; attach to context and rate-limit bucket.
- Resolve install/version: query registry for tenants install of `extensionId`; get `version_id` and `content_hash`.
- Resolve endpoint: load manifest for that version (from registry/bundle manifest cache) and match `{method, path}` against `api.endpoints` (support path params). If not found, return 404.
- Build Execute call: construct a request for Runner with context and normalized HTTP payload. Generate an idempotency key for non-GET from `request_id || hash(method+url+body)`.
- Forward to Runner: call `POST {RUNNER_BASE_URL}/v1/execute` with a short-lived service token. Propagate an allowlist of headers (e.g., `x-request-id`, `accept`, `content-type`) and strip end-user `authorization`.
- Timeout & retries: apply `EXT_GATEWAY_TIMEOUT_MS` (default 5s). Retries only on 502/503/504 with jitter and idempotency for safe methods.
- Return response: map Runners `{status, headers, body}` to `NextResponse`. Enforce response header allowlist and size limits.
Execute API (Runner)
- Request JSON (abridged):
```
{
"context": {
"request_id": "uuid",
"tenant_id": "t_123",
"extension_id": "com.alga.softwareone",
"content_hash": "sha256:...",
"version_id": "ver_abc"
},
"http": {
"method": "POST",
"path": "/agreements/sync",
"query": { "force": "true" },
"headers": { "content-type": "application/json" },
"body_b64": "eyJwYXlsb2FkIjoiLi4uIn0="
},
"limits": { "timeout_ms": 5000, "memory_mb": 256 }
}
```
- Response JSON (abridged):
```
{
"status": 200,
"headers": { "content-type": "application/json" },
"body_b64": "eyJyZXN1bHQiOiJPSyJ9"
}
```
Header policy (allowlist / strip):
- Forward: `x-request-id`, `accept`, `content-type`, `accept-encoding`, `user-agent` (normalized), `x-alga-tenant` (added by gateway), `x-alga-extension` (added), `x-idempotency-key` (generated for non-GET).
- Strip: `authorization` from end-user; gateway authenticates user and injects a service credential to Runner.
- Response: allow `content-type`, `cache-control` (if safe), custom `x-` headers under `x-ext-*`. Disallow `set-cookie` and hop-by-hop headers.
Security and limits:
- RBAC: verify user can access the extension/endpoint before proxying.
- Quotas: apply per-tenant rate limit and concurrency caps at the gateway; Runner enforces execution quotas.
- Size: cap request/response body (e.g., 510 MB) with clear 413/502 handling.
- Timeouts: default 5s; allow per-endpoint overrides with safe maximums (e.g., 30s).
Example Next.js handler (abridged):
```
// server/src/app/api/ext/[extensionId]/[...path]/route.ts
import { NextRequest, NextResponse } from 'next/server';
export async function handler(req: NextRequest, ctx: { params: { extensionId: string; path: string[] } }) {
const requestId = req.headers.get('x-request-id') || crypto.randomUUID();
const method = req.method;
const { extensionId, path } = ctx.params;
const pathname = '/' + (path || []).join('/');
const url = new URL(req.url);
const tenantId = await getTenantFromAuth(req);
await assertAccess(tenantId, extensionId, method, pathname);
const install = await getTenantInstall(tenantId, extensionId);
if (!install) return NextResponse.json({ error: 'Not installed' }, { status: 404 });
const { version_id, content_hash } = await resolveVersion(install);
const endpoint = await resolveEndpoint(version_id, method, pathname);
if (!endpoint) return NextResponse.json({ error: 'Not found' }, { status: 404 });
const bodyBuf = method === 'GET' ? undefined : Buffer.from(await req.arrayBuffer());
const execReq = {
context: { request_id: requestId, tenant_id: tenantId, extension_id: extensionId, content_hash, version_id },
http: {
method,
path: pathname,
query: Object.fromEntries(url.searchParams.entries()),
headers: filterHeaders(req.headers),
body_b64: bodyBuf ? bodyBuf.toString('base64') : undefined
},
limits: { timeout_ms: Number(process.env.EXT_GATEWAY_TIMEOUT_MS) || 5000 }
};
const runnerResp = await fetch(`${process.env.RUNNER_BASE_URL}/v1/execute`, {
method: 'POST',
headers: {
'content-type': 'application/json',
'x-request-id': requestId,
'authorization': await getRunnerServiceToken()
},
body: JSON.stringify(execReq),
signal: AbortSignal.timeout(Number(process.env.EXT_GATEWAY_TIMEOUT_MS) || 5000)
});
if (!runnerResp.ok) {
return NextResponse.json({ error: 'Runner error' }, { status: 502 });
}
const { status, headers, body_b64 } = await runnerResp.json();
const resHeaders = filterResponseHeaders(headers);
const body = body_b64 ? Buffer.from(body_b64, 'base64') : undefined;
return new NextResponse(body, { status, headers: resHeaders });
}
export { handler as GET, handler as POST, handler as PUT, handler as PATCH, handler as DELETE };
```
## Runtime Decision: Wasmtime (WASM-only)
- Choice: Use Wasmtime as the sole runtime for executing extensions as WebAssembly modules.
- Rationale (enterprise maturity):
- Backed by the Bytecode Alliance with a strong track record, multiple independent security audits, and responsive CVE handling.
- Production adoption across vendors; frequent releases; stable WASI Preview 1 support and growing Preview 2/component-model support.
- Rich security controls: memory limits, epoch-based interruption/timeouts, fuel metering, pooling allocator for predictable resource usage.
- Precompilation/caching: supports ahead-of-time compilation and serialized modules to reduce cold starts.
- Well-documented embedding API (Rust first-class, C API for other languages). We will implement the Runner as a Rust service embedding Wasmtime.
Implementation notes:
- Language targets: prioritize AssemblyScript and Rust for authoring extensions that compile to WASI-compatible WASM; consider TinyGo where appropriate. Provide a TypeScript SDK for descriptor-driven UIs and for authoring AssemblyScript-based handlers.
- Host API binding: expose capability-scoped functions as WASI-like imports via Wasmtimes Linker (e.g., `alga.storage.get/set`, `alga.http.fetch`, `alga.secrets.get`, `alga.log.info`). No filesystem preopens; no ambient authority.
- Resource controls: enforce per-invocation memory limits, timeouts via epoch interruption, and optional fuel metering for CPU budgeting. Configure pooling allocator to cap concurrent memory usage.
- Provenance: require signed bundles; verify content hash and signature before loading modules. Cache precompiled modules by hash.
- Isolation: one module instance per invocation (or per short-lived execution window). No shared mutable state beyond brokered APIs.
- Multi-pod safety: Raw and precompiled artifacts stored in object storage keyed by content hash + target. Runners use only ephemeral local caches; no node-local persistent volumes required.
### Execution Lifecycle
1. Authoring: Devs build against SDK + Host API types; `alga-ext` CLI validates locally.
2. Package: CLI produces a bundle (manifest, lockfile, compiled WASM) and signs it; optional AOT precompile for target architectures.
3. Publish: Push to registry; bundle stored in object storage by content hash.
4. Install: Tenant admin approves capabilities; per-tenant install record created with RLS.
5. Run: Event triggers runner → verify signature → load/precompiled module → instantiate with restricted Store/Linker → execute handler with brokered I/O only.
6. Observe: Logs, metrics, and traces recorded with per-tenant attribution; failures are quarantined.
### Security Controls
- Code provenance: signature verification, content-addressed storage, SBOM capture.
- Sandboxing: Wasmtime isolates; no in-process eval/import of tenant JS; no preopened FS; no raw sockets; capability-scoped host imports only.
- Resource limits: Wasmtime memory limits, epoch-based timeouts, optional fuel metering, and concurrency guards via worker pools.
- Egress policy: deny by default; allowlist per tenant/extension with optional TLS pinning.
- Secrets: mounted via broker with fine-grained tokens; never exposed wholesale.
- Audit: structured logs, event->execution correlation IDs, immutable execution logs with retention.
### Data Model (initial)
- `extension_registry(id, name, publisher, latest_version, deprecation, created_at)`
- `extension_version(id, registry_id, semver, content_hash, signature, sbom_ref, created_at)`
- `extension_bundle(id, content_hash, storage_url, size, runtime, sdk_version)`
- `tenant_extension_install(id, tenant_id, registry_id, version_id, status, granted_caps, config, created_at)`
- `extension_secret(id, tenant_install_id, key, created_at)` (values in secret manager; reference only)
- `extension_event_subscription(id, tenant_install_id, event, filter, created_at)`
- `extension_kv_store(tenant_id, extension_id, namespace, key, value, updated_at)` with RLS
- `extension_execution_log(id, tenant_id, extension_id, event_id, started_at, finished_at, status, metrics, error)`
- `extension_quota_usage(tenant_id, extension_id, window_start, cpu_ms, mem_mb_ms, invocations, egress_bytes)`
### Public APIs (EE)
- Registry: list/get/publish/deprecate versions (publisher-scoped, admin-only operations).
- Installation: install/uninstall/update; grant/revoke capabilities; manage secrets; validate config.
- Execution Admin: test-run, health, metrics, and logs (scoped to tenant).
- Event Subscriptions: list/update per tenant install.
## Current Implementation
- Initialization: No filesystem scanning. Extensions are managed via the v2 registry and pertenant installs.
- Registry: Stores v2 manifest JSON and versioned bundle metadata. Tenant installs select a version and granted capabilities.
- UI delivery: Iframeonly via the Runner at ${RUNNER_PUBLIC_BASE}/ext-ui/{extensionId}/{content_hash}/[...], bootstrapped with the iframe bridge.
- Gateway: All server calls go through /api/ext/[extensionId]/[...] (Gateway → Runner /v1/execute).
- Storage/security: Tenantscoped storage services with capabilityscoped Host APIs. Bundles are signed and contentaddressed.
## Bundle & Manifest v2 (draft)
- Manifest keys: `name`, `publisher`, `version`, `runtime` (e.g., `wasm-js@1`), `capabilities` (explicit list), `ui` (iframe app definition), `events` (subscriptions), `entry` (runner entrypoint), `assets` (UI/static files), `sbom`.
- Artifact: tarball with deterministic layout; top-level `manifest.json`, `entry.wasm` or isolated JS, `descriptors/`, and `SIGNATURE`.
- Signing: compute SHA256 over canonical bundle; sign with developer certificate; store signature and public cert in registry.
Example (abridged):
```
{
"name": "com.alga.softwareone",
"publisher": "SoftwareOne",
"version": "1.2.3",
"runtime": "wasm-js@1",
"capabilities": ["http.fetch", "storage.kv", "secrets.get"],
"ui": {
"type": "iframe",
"entry": "ui/index.html",
"routes": [
{ "path": "/agreements", "iframePath": "ui/agreements.html" },
{ "path": "/statements", "iframePath": "ui/statements.html" }
]
},
"events": [{ "topic": "billing.statement.created", "handler": "dist/handlers/statement.js" }],
"entry": "dist/main.wasm",
"precompiled": {
"x86_64-linux-gnu": "artifacts/cwasm/x86_64-linux-gnu/main.cwasm",
"aarch64-linux-gnu": "artifacts/cwasm/aarch64-linux-gnu/main.cwasm"
},
"api": {
"endpoints": [
{ "method": "GET", "path": "/agreements", "handler": "dist/handlers/http/list_agreements" },
{ "method": "POST", "path": "/agreements/sync", "handler": "dist/handlers/http/sync" }
]
},
"assets": ["ui/**/*"],
"sbom": "sbom.spdx.json"
}
```
## Host API v1 (draft surface)
- Core: `context.extension()`, `context.tenant()`, `context.user()`
- Storage: `storage.get/set/delete/list`, namespaces; per-tenant/per-extension isolation
- HTTP: `http.fetch(url, opts)` via egress broker with allowlists
- Secrets: `secrets.get(key)` returning scoped secret handles
- Events: `events.emit(topic, payload)`, `events.subscribe(topic)` via manifest
- Schedules: `schedules.register(id, cron, handler)` (phase 2/3)
- Logging/Metrics: `log.info/warn/error`, `metrics.counter/gauge/histogram`
## Milestones & Acceptance
- M1: Registry + Bundle Store + Signing
- Publish/Install flows working; schema migrations in place; signatures verified on install
- M2: Runner Service + Host API v1
- Execute a hello-world WASM extension via Wasmtime with quotas/timeouts and audit logs
- M3: Client SDK (iframe)
- Render UI via iframe apps using the Alga Client SDK; CSP enforced; no raw dynamic import of tenant JS
- M4: E2E for first partner
- One extension fully migrated; per-tenant install/config on prod-like env
Phase 1 Foundations
- Ship SDK v1, Host API v1 (capabilities: events, storage.kv, http.fetch via broker, secrets.get, log/metrics).
- Implement Registry, Bundle Storage, and Build validation path; enable signed bundle install.
Phase 2 Runner Service
- Add WASM/isolate runner with quotas, timeouts, and signature verification.
- Integrate Event Bus; implement execution logs and basic metrics.
Phase 3 UI Extensions
- Iframe-based UI host with CSP sandbox and postMessage bridge; asset signing pipeline.
Phase 4 Migration & Deprecation
- Provide migration guides; wrap legacy extensions via out-of-process adapters where feasible.
- Hard deprecate in-process uploads/imports; remove code paths.
## Backwards Compatibility
- Legacy extensions can be proxied through the runner as external HTTP endpoints temporarily.
- Provide an adapter library to help repackage common patterns into bundles.
## Operational Considerations
- Horizontal scale runner workers; shard by tenant to localize impact.
- Warm cache frequently used bundles; prefetch on event bursts.
- Circuit breakers and quarantine for crash loops or policy violations.
## Success Metrics
- 0 in-process executions of tenant code in app.
- P99 execution latency under target with sandboxing enabled.
- No cross-tenant data access in penetration tests.
- All bundles signed and verified; 100% execution logs correlated to events.
## Open Questions
- Which sandbox runtime to standardize on first: WASM (Wasmtime/WASI) vs V8 isolates? Preference: WASM for stronger capability discipline; allow a container tier for heavy/legacy cases.
- Initial capability set scope: finalize MVP host APIs.
- Pricing/billing alignment with quotas and egress costs.
## Near-term Implementation Tasks (Progress Tracker)
The following concrete tasks align the current codebase with this plan and track progress.
- [x] Replace browser→S3 direct upload with server-proxied streaming
- [x] Add server action `extUploadProxy(FormData)` to stream file to S3 staging (write-once)
- [x] Convert Web ReadableStream → Node Readable before S3 PutObject
- [x] Pass `ContentLength` to S3 to satisfy chunked signing
- [x] Update `InstallerPanel.tsx` to use server action, then call `extFinalizeUpload`
- [x] Remove presigned initiate flow and delete `initiate-upload` API route
- [x] Logging and diagnostics
- [x] Structured logs + request IDs for upload path
- [x] Admin-only DB registry introspection endpoint (`/api/extensions/registry-db-check`)
- [ ] Add request IDs and structured logs to finalize and abort paths
- [x] Registry v2 repository wiring
- [x] Implement Knex-backed `RegistryV2Repository` (extensions + versions)
- [x] Register via `setRegistryV2Repository(...)` at server startup (lazy init before finalize)
- [x] Verify finalize writes registry/version/bundle rows end-to-end
- [x] Extensions UI uses Registry v2
- [x] List tenant installs via v2 actions (joins on `tenant_extension_install`)
- [x] Toggle/uninstall operate on `tenant_extension_install`
- [x] After finalize, auto-create tenant install for current tenant
- [ ] Align UI with “Install from Registry” flow [FUTURE -- DELAY]
- [ ] Restrict or hide direct upload UI for general users (admin/publisher only if retained)
- [ ] Replace “upload bundle” with “select version” from registry listing
- [ ] Update docs to emphasize CI publish + install-from-registry
- [ ] Cleanup and tests
- [ ] Remove unused upload API route and legacy code paths once fully migrated
- [ ] Add targeted tests for upload server action and finalize happy-path
## Retirement of Legacy Paths (Brand New System)
- Legacy tables and services to avoid for EE extensions:
- `extensions`, `extension_permissions`, file-based component serving, and dynamic module import mechanisms.
- `ExtensionRegistry` (legacy) and actions that operate on the `extensions` table in management UI.
- Canonical tables for EE extensions (Registry v2):
- `extension_registry`, `extension_version`, `extension_bundle`, `tenant_extension_install`.
- UI and actions must exclusively use Registry v2:
- Listing, enable/disable, and uninstall operate on `tenant_extension_install`.
- Version metadata read from `extension_version`; registry identity from `extension_registry`.
- Bundle metadata resolved from object storage keyed by content hash.
- Operational note: This system is brand new; no data migration is required. Do not write or read from legacy tables as part of EE extensions.