Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
9.8 KiB
9.8 KiB
Extension Storage API Plan
Overview
- Deliver a durable, host-managed storage API backed by our existing Citus (Postgres) deployment for the EE extension system.
- Provide extensions with structured, multi-tenant storage primitives (namespaced key/value, optional structured collections, blob handles) while the host enforces quotas, schema validation, and tenancy.
- Establish the operational, observability, and rollout guardrails needed to evolve the storage surface without exposing raw database access.
Goals
- Ship an initial storage API surface that lets extensions persist and retrieve JSON payloads with transactional guarantees.
- Enforce per-tenant, per-extension quotas, size limits, and optimistic concurrency.
- Integrate Runner capability checks so only extensions granted
alga.storagecan access the API. - Deliver documentation and SDK updates that make the storage API consumable from both WASM handlers and iframe UIs.
Non-Goals
- Building a general-purpose relational modeling layer for extensions (future consideration once demand is proven).
- Exposing raw SQL/Redis interfaces or direct database credentials to extensions.
- Implementing durability upgrades for Redis (tracked separately; only revisit if Phase 3 indicates a gap).
Current State (as of 2025-10-08)
- Runner exposes limited host APIs (http, secrets, logging, metrics); storage capability is scoped for v2 but not yet implemented.
- Persistent data for the EE platform relies on Citus, which provides HA, backups, and tenant sharding. Redis operates as a non-durable cache/stream substrate.
- Extension manifests can request
alga.storage, but capability enforcement currently rejects all calls. - No shared schema or tables exist for extension-owned data.
Status update (2025-11-21):
- Runner now exposes
alga.storagecapability backed by the internal APIPOST /api/internal/ext-storage/install/{installId}(seeee/runner/src/engine/host_api.rsandee/server/src/app/api/internal/ext-storage/install/[installId]/route.ts). - Manifest/runtime code paths accept
storage.kvcapability; gateway execute payload includesinstall_id? (still missing) but passesconfig/providers/secretEnvelopeand uses install-scoped token headers. - Quotas/version headers and RBAC beyond token gating remain open; docs still refer to tenant storage service—needs reconciliation with the live capability implementation.
Risks and Mitigations
- Unbounded growth / noisy neighbors → enforce quotas, TTL, and cardinality limits per tenant+extension namespace; surface metrics.
- Schema drift and breaking changes → version storage contracts per namespace with JSON Schema validation and change review.
- Hot partitions → align Citus distribution key with tenant/extension; add secondary indexes on frequently queried attributes.
- Abuse or sensitive data exfiltration → tie access to capability checks, RBAC, and audit logging, and inherit existing egress allowlists.
- Operational load on primary database → stage load testing and monitor Citus shards before rollout; add connection pooling and caching where appropriate.
Design Summary
- Back storage collections with Citus tables using JSONB columns (
value,metadata) and typed primitives for keys, namespaces, version, timestamps. - Namespace records by
tenant_id,extension_install_id,logical_namespace, andkey. - Provide base operations:
put(with optional conditional version),get,list,delete, andbulkPut. - Introduce optional collection types (append-only log, blob references) gated by manifests and quotas, but start with key/value.
- Access via Runner host API
alga.storage.*(gRPC/JSON over host bridge). API Gateway proxies REST requests from iframe UI to Runner when the extension SDK calls storage endpoints. - Observability includes structured audit logs, Prometheus metrics (ops, latency, bytes), and dashboards per tenant/extension.
Phases and TODOs
Phase 1 — Product & Contract Definition
- Finalize storage API contract (operations, error codes, optimistic concurrency model) with DX stakeholders. See storage-api-contract.md.
- Define resource hierarchy: tenant → extension install → namespace → key/value records (documented in storage-api-contract.md).
- Produce JSON Schema validation strategy: per-namespace schema registry, version negotiation, and validation failure responses. See storage-api-validation.md.
- Specify quotas and limits (per extension install): max namespaces, keys per namespace, value size, total storage (documented in storage-api-validation.md).
- Draft API reference docs and manifest capability requirements (captured in storage-api-access-control.md).
- Align security review on capability scopes, RBAC, and audit requirements (see storage-api-access-control.md for baseline).
Phase 2 — Data Modeling & Infrastructure
- Design Citus schema (see storage-api-schema.md):
- Create partitioned table
ext_storage_recordswith distribution keytenant.- Columns:
tenant_id,extension_install_id,namespace,key,value(JSONB),metadata(JSONB),revision(BIGINT),ttl_expires_at, timestamps. - Unique constraint on (
tenant_id,extension_install_id,namespace,key). - Supporting indexes for namespace scans and TTL sweeps.
- Columns:
- Implement schema migrations (BiggerBoat) with down migrations and rollout notes (see storage-api-rollout.md).
- Add opportunistic TTL cleanup that piggybacks on read/write requests to delete expired records without background jobs (documented in storage-api-rollout.md).
- Prepare load testing harness to simulate extension workloads (insert, list, update) (outlined in storage-api-rollout.md).
- Validate shard distribution and index plans in staging; tune connection pool settings (see storage-api-operations.md).
- Update backup/restore playbooks to include extension storage tables (guidance in storage-api-operations.md).
Phase 3 — Service Implementation
- Runner host API:
- Implement
alga.storage.put/get/delete/listin Runner (Rust) backed by new storage service client. - Enforce capability checks and quotas before dispatching queries.
- Add optimistic concurrency via
ifRevisionheader andrevisionincrements. - Emit structured logs and metrics (operation, latency, bytes).
- Implement
- Storage service layer (TypeScript/Node):
- Storage service layer (TypeScript/Node):
- Create module interfacing with Citus via existing pool (
ee/server/src/lib/db). - Implement transactional operations, schema validation hooks, and quota enforcement.
- Introduce caching for schema definitions and quota counters where necessary.
- Create module interfacing with Citus via existing pool (
- API Gateway & SDK:
- Expose REST endpoints for iframe clients (e.g.,
POST /api/ext-storage/[namespace]). - Update iframe SDK and WASM client to call the new host API methods.
- Add integration tests covering storage flows (Runner ↔ storage ↔ DB roundtrip).
- Expose REST endpoints for iframe clients (e.g.,
Phase 4 — Observability, Security, and Rollout
- Add Prometheus dashboards and alerts for operation throughput, error rates, quota near-exhaustion, and latency.
- Wire audit logs to central pipeline (tenant id, extension id, namespace, operation, actor).
- Pen-test and threat model the new surface; ensure no cross-tenant leakage in queries.
- Document runbooks: quota breach, shard saturation, schema update process.
- Stage rollout:
- Enable capability for selected internal extensions.
- Validate load tests and real usage metrics.
- Gradually enable for beta partners, then GA.
- Post-GA cleanup: finalize docs, sunset temporary feature flags, log final status.
Dependencies & Coordination
- Runner team for host API implementation and capability enforcement.
- Database platform team for Citus schema review, migration scheduling, and capacity planning.
- Security/compliance for data handling approvals and audit log schema.
- DX docs & SDK teams for developer documentation and client library updates.
Acceptance Criteria
- Extensions with
alga.storagecapability can perform CRUD operations with consistent results across Runner and iframe SDK. - Storage tables exhibit expected performance under simulated production load (p95 latency < defined SLO).
- Quotas prevent unbounded growth and surface actionable alerts when near limits.
- Audit logs trace all storage mutations with tenant/extension attribution.
- Documentation (API reference, examples) published in
ee/docs/extension-system.
Rollback Plan
- Disable
alga.storagecapability flag to stop extension access while keeping data intact. - Revert Runner host API deployment if regressions surface.
- Roll back database migrations via BiggerBoat down migrations if schema changes must be undone (requires maintenance window).
- Restore from Citus backups if data corruption occurs; coordinate with DB team for tenant-scoped restores.
Future Enhancements
- Add specialized collections (append-only logs, counters, queues) based on extension demand.
- Explore Redis-backed accelerators for high-throughput patterns once HA Redis is available.
- Introduce fine-grained access policies and per-record ACLs for multi-actor extensions.
- Provide analytics snapshots and export tooling for extension data portability.