Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

13 KiB

PRD — Workflow Step Quota Accounting

  • Slug: 2026-04-28-workflow-step-quota-accounting
  • Date: 2026-04-28
  • Status: Draft

Summary

Add tenant-level accounting and enforcement for workflow step executions. Each tenant receives a workflow step allotment per payment period. Every workflow step attempt consumes one unit when the step starts. When a tenant exhausts its allotment, workflows pause at the current step instead of failing. Paused workflows can resume automatically when quota becomes available or manually after the same eligibility check passes.

The payment period is the tenant's active Stripe subscription period. If no valid active Stripe subscription period exists, the system falls back to the current UTC calendar month and tier default limits.

Problem

Workflow steps are a licensed resource, but the workflow runtime currently has no tenant-level accounting or enforcement for step executions. Existing workflow_run_steps rows provide an execution audit trail, but they are not designed for fast, concurrency-safe quota enforcement across both DB and Temporal workflow runtimes.

Without quota accounting, tenants can exceed plan allotments, runaway loops or retries can burn runtime capacity without licensing protection, and support/product teams lack a clear per-period usage object.

Goals

  • Count every workflow step attempt for a tenant when the step starts.
  • Enforce tenant plan allotments per payment period.
  • Use Stripe subscription current_period_start and current_period_end as the primary quota window.
  • Fall back to the current UTC calendar month with tier defaults when no active Stripe period exists.
  • Support tier defaults of:
    • solo: 150 steps per period
    • pro: 750 steps per period
    • premium: 10,000 steps per period
  • Support Stripe metadata override via workflow_step_limit, including numeric values and unlimited.
  • Pause workflows on quota exhaustion instead of failing them.
  • Resume quota-paused workflows automatically via scheduled job and manually via user/admin action.
  • Use one shared enforcement path for DB runtime, Temporal runtime, automatic resume, and manual resume.
  • Preserve workflow_run_steps as the execution audit ledger while adding a dedicated counter table for enforcement.

Non-goals

  • Do not use contract-system recurring_service_periods, cadence ownership, invoice windows, or contract line periods for workflow quota windows.
  • Do not bill customers directly from workflow step usage in this phase.
  • Do not build a full customer-facing usage dashboard in this phase unless needed for manual resume messaging.
  • Do not auto-repair counter drift initially; report drift for diagnosis first.
  • Do not treat quota exhaustion as workflow failure or trigger normal retry policies.
  • Do not change the semantics of action idempotency or workflow step retry policies except that retry attempts consume step quota when they start.

Users and Primary Flows

Tenant workflow executor

A workflow run starts or resumes. Before each step attempt, the runtime reserves one step unit. If quota is available, the step executes normally. If quota is exhausted, the run pauses at the current node_path and records a quota wait.

MSP admin / tenant user

A user viewing a quota-paused run sees that the workflow is waiting because the tenant exhausted its workflow step allotment. The user can retry/resume manually. If quota is still exhausted, the system returns a clear message with current usage, limit, and reset time. If quota is available, the run resumes through normal runtime execution.

Background scheduler

A recurring job scans quota waits. When a tenant's next payment period starts or the tenant's effective limit increases, the job resumes eligible quota-paused runs. Actual quota consumption still happens when the runtime re-enters step start.

Support / operations

Support can inspect the tenant's current usage period, effective limit, source, and drift report comparing the enforcement counter to workflow_run_steps audit rows.

UX / UI Notes

  • Quota-paused runs should be displayed as waiting/paused, not failed.
  • The run detail or inspector should surface a clear reason such as: "Workflow step quota exceeded for current billing period."
  • Manual resume should be available only for quota-paused runs and should not bypass quota.
  • If quota remains exhausted, the response should include:
    • used count
    • effective limit, or unlimited
    • reset/payment period end time
    • quota source, if useful for support
  • UI changes can be minimal in the first implementation if existing run logs/wait records are visible enough for operators.

Requirements

Functional Requirements

  1. Count every step attempt at step start.
  2. Do not count quota-blocked steps because they did not start.
  3. Count retry attempts as new step attempts.
  4. Count forEach body attempts per item/attempt.
  5. Count event.wait, time.wait, and human.task steps when first entered.
  6. Resolve quota windows from active Stripe subscription periods when available.
  7. Fall back to current UTC calendar month when no valid active Stripe period is available.
  8. Resolve effective limits using Stripe price metadata, then Stripe product metadata, then tier defaults.
  9. Support workflow_step_limit=unlimited as an unlimited cap while still recording usage.
  10. Atomically reserve quota across concurrent workers.
  11. Pause runs at the current step when quota is exhausted.
  12. Create or reuse a workflow_run_waits row with wait_type = 'quota' on quota exhaustion.
  13. Preserve the current workflow_runs.node_path while quota-paused.
  14. Resume eligible quota-paused runs via scheduled job.
  15. Resume eligible quota-paused runs via manual user/admin action.
  16. Prevent manual resume from bypassing quota.
  17. Provide a reconciliation/reporting path comparing enforcement counters to workflow_run_steps audit rows.

Non-functional Requirements

  • Quota reservation must be concurrency-safe across DB runtime workers and Temporal activity workers.
  • Runtime enforcement must avoid expensive aggregate counts on every step.
  • Missing or invalid Stripe metadata must not crash workflow execution.
  • Missing Stripe period data must not immediately disable workflows; use fallback calendar periods with tier defaults.
  • Quota pause must not be treated as a workflow failure for retry or auto-pause failure-rate logic.
  • The design must use tenant as the column name in new schema, not tenant_id.

Data / API / Integrations

New enforcement table

Create workflow_step_usage_periods:

- tenant                  not null
- period_start            timestamptz not null
- period_end              timestamptz not null
- period_source           text not null -- stripe_subscription | fallback_calendar
- stripe_subscription_id  nullable
- effective_limit         integer nullable -- null means unlimited
- used_count              integer not null default 0
- limit_source            text not null -- stripe_price_metadata | stripe_product_metadata | tier_default | unlimited_metadata
- tier                    text not null
- metadata_json           jsonb nullable
- created_at              timestamptz not null
- updated_at              timestamptz not null

Use a composite primary key or unique key on (tenant, period_start, period_end). Add indexes for (tenant, period_end) and (period_end).

Quota resolver

The resolver returns a normalized quota summary:

{
  tenant: string;
  periodStart: string;
  periodEnd: string;
  periodSource: 'stripe_subscription' | 'fallback_calendar';
  stripeSubscriptionId?: string | null;
  effectiveLimit: number | null; // null = unlimited
  usedCount: number;
  remaining: number | null; // null = unlimited
  tier: 'solo' | 'pro' | 'premium';
  limitSource:
    | 'stripe_price_metadata'
    | 'stripe_product_metadata'
    | 'tier_default'
    | 'unlimited_metadata';
}

Stripe integration

Preferred active subscription selection:

  • status IN ('trialing', 'active', 'past_due', 'unpaid')
  • valid current_period_start and current_period_end
  • if multiple subscriptions exist, prefer trialing, then active, then past_due, then unpaid

Metadata key:

workflow_step_limit

Valid values:

  • positive integer string/number, e.g. 750
  • unlimited

Precedence:

  1. Stripe price metadata
  2. Stripe product metadata
  3. Tier default

Runtime integration points

DB runtime:

  • shared/workflow/runtime/runtime/workflowRuntimeV2.ts
  • Enforcement point: immediately after resolveStepAtPath() and before WorkflowRunStepModelV2.create().

Temporal runtime:

  • ee/temporal-workflows/src/activities/workflow-runtime-v2-activities.ts
  • Enforcement point: projectWorkflowRuntimeV2StepStart() before WorkflowRunStepModelV2.create().

Quota wait payload

Use existing workflow_run_waits table with:

wait_type = 'quota'
status = 'WAITING'
timeout_at = period_end
payload = {
  reason: 'workflow_step_quota_exceeded',
  tenant,
  periodStart,
  periodEnd,
  usedCount,
  effectiveLimit,
  periodSource,
  limitSource
}

Scheduled job

Add a recurring job handler such as workflow-quota-resume-scan that scans quota waits and resumes eligible runs. Use the existing job system and batch/locking safeguards.

Manual resume API/action

Add a user/admin action such as resumeWorkflowRunFromQuotaPause(runId) that verifies permissions, confirms the run is quota-paused, checks quota eligibility, and resumes the run only when allowed.

Security / Permissions

  • Quota counters and wait records must remain tenant-isolated.
  • Manual resume must verify the user has permission to manage or operate the workflow run for the tenant.
  • Manual resume must not allow cross-tenant run access.
  • Manual resume must not bypass the runtime reservation path.
  • If RLS policies apply to workflow runtime tables in the target environment, add matching policies for workflow_step_usage_periods.

Observability

Use workflow_run_logs and service logs for:

  • quota reservation success
  • quota exceeded / run quota-paused
  • quota-paused run resumed by scheduled job
  • quota-paused run resumed by manual action
  • resume skipped with reason
  • invalid Stripe metadata fallback
  • fallback-calendar period usage
  • reconciliation drift findings

The counter table should store enough metadata to answer why a limit was applied, including period source, limit source, tier, Stripe subscription id, and fallback reason when applicable.

Rollout / Migration

  1. Add the enforcement table with indexes and tenant isolation/RLS if needed.
  2. Add shared quota resolver/reservation service behind the runtime integration.
  3. Integrate DB runtime step-start enforcement.
  4. Integrate Temporal activity step-start enforcement.
  5. Add quota wait/resume handling.
  6. Add scheduled resume scan job.
  7. Add manual resume action and minimal UI/API surfacing.
  8. Add reconciliation/reporting helper.
  9. Backfill is not required because enforcement starts from newly created usage period rows. Historical step-row reconciliation can report prior usage if needed.

Rollout should be staged so that quota resolution and counter creation can be tested before hard enforcement is enabled if a feature flag or environment toggle is desired during deployment.

Risks

  • Counter drift could occur if any runtime path creates step rows without using the quota service.
  • Concurrent resume scans could resume more runs than remaining quota, but runtime reservation still protects final enforcement.
  • Stripe sync gaps could cause fallback-calendar periods to be used unexpectedly.
  • Temporal workflow behavior must treat quota as a controlled wait/pause, not an activity failure that causes unintended retries.
  • Existing UI may not clearly distinguish quota wait from other waits until minimal UI surfacing is added.

Open Questions

  • Should zero-limit metadata ever be valid, or should it remain invalid and fall back to tier default? Current design treats zero as invalid.
  • What exact permission should gate manual quota resume if no dedicated workflow-run operation permission exists?
  • Should rollout include an explicit enforcement feature flag, or is the plan to enforce immediately once shipped?

Acceptance Criteria (Definition of Done)

  • A tenant's workflow step usage is counted in workflow_step_usage_periods per resolved payment period.
  • The runtime atomically reserves one unit before each step attempt across DB and Temporal engines.
  • A tenant at the finite limit is quota-paused before a new step starts.
  • Quota-paused runs have workflow_runs.status = 'WAITING', preserve node_path, and have a workflow_run_waits.wait_type = 'quota' record.
  • Quota-blocked steps do not create workflow_run_steps rows and do not increment usage.
  • Unlimited tenants continue executing while usage is still recorded.
  • Stripe metadata overrides and tier defaults resolve as specified.
  • Missing Stripe periods use the UTC calendar month fallback with tier defaults.
  • Scheduled resume scan resumes eligible quota-paused runs without bypassing runtime quota checks.
  • Manual resume resumes eligible quota-paused runs and returns a helpful exhausted-quota response otherwise.
  • Tests cover quota resolution, atomic reservation, DB runtime enforcement, Temporal runtime enforcement, resume behavior, and reconciliation drift detection.