Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
13 KiB
PRD — Workflow Step Quota Accounting
- Slug:
2026-04-28-workflow-step-quota-accounting - Date:
2026-04-28 - Status: Draft
Summary
Add tenant-level accounting and enforcement for workflow step executions. Each tenant receives a workflow step allotment per payment period. Every workflow step attempt consumes one unit when the step starts. When a tenant exhausts its allotment, workflows pause at the current step instead of failing. Paused workflows can resume automatically when quota becomes available or manually after the same eligibility check passes.
The payment period is the tenant's active Stripe subscription period. If no valid active Stripe subscription period exists, the system falls back to the current UTC calendar month and tier default limits.
Problem
Workflow steps are a licensed resource, but the workflow runtime currently has no tenant-level accounting or enforcement for step executions. Existing workflow_run_steps rows provide an execution audit trail, but they are not designed for fast, concurrency-safe quota enforcement across both DB and Temporal workflow runtimes.
Without quota accounting, tenants can exceed plan allotments, runaway loops or retries can burn runtime capacity without licensing protection, and support/product teams lack a clear per-period usage object.
Goals
- Count every workflow step attempt for a tenant when the step starts.
- Enforce tenant plan allotments per payment period.
- Use Stripe subscription
current_period_startandcurrent_period_endas the primary quota window. - Fall back to the current UTC calendar month with tier defaults when no active Stripe period exists.
- Support tier defaults of:
solo: 150 steps per periodpro: 750 steps per periodpremium: 10,000 steps per period
- Support Stripe metadata override via
workflow_step_limit, including numeric values andunlimited. - Pause workflows on quota exhaustion instead of failing them.
- Resume quota-paused workflows automatically via scheduled job and manually via user/admin action.
- Use one shared enforcement path for DB runtime, Temporal runtime, automatic resume, and manual resume.
- Preserve
workflow_run_stepsas the execution audit ledger while adding a dedicated counter table for enforcement.
Non-goals
- Do not use contract-system
recurring_service_periods, cadence ownership, invoice windows, or contract line periods for workflow quota windows. - Do not bill customers directly from workflow step usage in this phase.
- Do not build a full customer-facing usage dashboard in this phase unless needed for manual resume messaging.
- Do not auto-repair counter drift initially; report drift for diagnosis first.
- Do not treat quota exhaustion as workflow failure or trigger normal retry policies.
- Do not change the semantics of action idempotency or workflow step retry policies except that retry attempts consume step quota when they start.
Users and Primary Flows
Tenant workflow executor
A workflow run starts or resumes. Before each step attempt, the runtime reserves one step unit. If quota is available, the step executes normally. If quota is exhausted, the run pauses at the current node_path and records a quota wait.
MSP admin / tenant user
A user viewing a quota-paused run sees that the workflow is waiting because the tenant exhausted its workflow step allotment. The user can retry/resume manually. If quota is still exhausted, the system returns a clear message with current usage, limit, and reset time. If quota is available, the run resumes through normal runtime execution.
Background scheduler
A recurring job scans quota waits. When a tenant's next payment period starts or the tenant's effective limit increases, the job resumes eligible quota-paused runs. Actual quota consumption still happens when the runtime re-enters step start.
Support / operations
Support can inspect the tenant's current usage period, effective limit, source, and drift report comparing the enforcement counter to workflow_run_steps audit rows.
UX / UI Notes
- Quota-paused runs should be displayed as waiting/paused, not failed.
- The run detail or inspector should surface a clear reason such as: "Workflow step quota exceeded for current billing period."
- Manual resume should be available only for quota-paused runs and should not bypass quota.
- If quota remains exhausted, the response should include:
- used count
- effective limit, or
unlimited - reset/payment period end time
- quota source, if useful for support
- UI changes can be minimal in the first implementation if existing run logs/wait records are visible enough for operators.
Requirements
Functional Requirements
- Count every step attempt at step start.
- Do not count quota-blocked steps because they did not start.
- Count retry attempts as new step attempts.
- Count
forEachbody attempts per item/attempt. - Count
event.wait,time.wait, andhuman.tasksteps when first entered. - Resolve quota windows from active Stripe subscription periods when available.
- Fall back to current UTC calendar month when no valid active Stripe period is available.
- Resolve effective limits using Stripe price metadata, then Stripe product metadata, then tier defaults.
- Support
workflow_step_limit=unlimitedas an unlimited cap while still recording usage. - Atomically reserve quota across concurrent workers.
- Pause runs at the current step when quota is exhausted.
- Create or reuse a
workflow_run_waitsrow withwait_type = 'quota'on quota exhaustion. - Preserve the current
workflow_runs.node_pathwhile quota-paused. - Resume eligible quota-paused runs via scheduled job.
- Resume eligible quota-paused runs via manual user/admin action.
- Prevent manual resume from bypassing quota.
- Provide a reconciliation/reporting path comparing enforcement counters to
workflow_run_stepsaudit rows.
Non-functional Requirements
- Quota reservation must be concurrency-safe across DB runtime workers and Temporal activity workers.
- Runtime enforcement must avoid expensive aggregate counts on every step.
- Missing or invalid Stripe metadata must not crash workflow execution.
- Missing Stripe period data must not immediately disable workflows; use fallback calendar periods with tier defaults.
- Quota pause must not be treated as a workflow failure for retry or auto-pause failure-rate logic.
- The design must use
tenantas the column name in new schema, nottenant_id.
Data / API / Integrations
New enforcement table
Create workflow_step_usage_periods:
- tenant not null
- period_start timestamptz not null
- period_end timestamptz not null
- period_source text not null -- stripe_subscription | fallback_calendar
- stripe_subscription_id nullable
- effective_limit integer nullable -- null means unlimited
- used_count integer not null default 0
- limit_source text not null -- stripe_price_metadata | stripe_product_metadata | tier_default | unlimited_metadata
- tier text not null
- metadata_json jsonb nullable
- created_at timestamptz not null
- updated_at timestamptz not null
Use a composite primary key or unique key on (tenant, period_start, period_end). Add indexes for (tenant, period_end) and (period_end).
Quota resolver
The resolver returns a normalized quota summary:
{
tenant: string;
periodStart: string;
periodEnd: string;
periodSource: 'stripe_subscription' | 'fallback_calendar';
stripeSubscriptionId?: string | null;
effectiveLimit: number | null; // null = unlimited
usedCount: number;
remaining: number | null; // null = unlimited
tier: 'solo' | 'pro' | 'premium';
limitSource:
| 'stripe_price_metadata'
| 'stripe_product_metadata'
| 'tier_default'
| 'unlimited_metadata';
}
Stripe integration
Preferred active subscription selection:
status IN ('trialing', 'active', 'past_due', 'unpaid')- valid
current_period_startandcurrent_period_end - if multiple subscriptions exist, prefer
trialing, thenactive, thenpast_due, thenunpaid
Metadata key:
workflow_step_limit
Valid values:
- positive integer string/number, e.g.
750 unlimited
Precedence:
- Stripe price metadata
- Stripe product metadata
- Tier default
Runtime integration points
DB runtime:
shared/workflow/runtime/runtime/workflowRuntimeV2.ts- Enforcement point: immediately after
resolveStepAtPath()and beforeWorkflowRunStepModelV2.create().
Temporal runtime:
ee/temporal-workflows/src/activities/workflow-runtime-v2-activities.ts- Enforcement point:
projectWorkflowRuntimeV2StepStart()beforeWorkflowRunStepModelV2.create().
Quota wait payload
Use existing workflow_run_waits table with:
wait_type = 'quota'
status = 'WAITING'
timeout_at = period_end
payload = {
reason: 'workflow_step_quota_exceeded',
tenant,
periodStart,
periodEnd,
usedCount,
effectiveLimit,
periodSource,
limitSource
}
Scheduled job
Add a recurring job handler such as workflow-quota-resume-scan that scans quota waits and resumes eligible runs. Use the existing job system and batch/locking safeguards.
Manual resume API/action
Add a user/admin action such as resumeWorkflowRunFromQuotaPause(runId) that verifies permissions, confirms the run is quota-paused, checks quota eligibility, and resumes the run only when allowed.
Security / Permissions
- Quota counters and wait records must remain tenant-isolated.
- Manual resume must verify the user has permission to manage or operate the workflow run for the tenant.
- Manual resume must not allow cross-tenant run access.
- Manual resume must not bypass the runtime reservation path.
- If RLS policies apply to workflow runtime tables in the target environment, add matching policies for
workflow_step_usage_periods.
Observability
Use workflow_run_logs and service logs for:
- quota reservation success
- quota exceeded / run quota-paused
- quota-paused run resumed by scheduled job
- quota-paused run resumed by manual action
- resume skipped with reason
- invalid Stripe metadata fallback
- fallback-calendar period usage
- reconciliation drift findings
The counter table should store enough metadata to answer why a limit was applied, including period source, limit source, tier, Stripe subscription id, and fallback reason when applicable.
Rollout / Migration
- Add the enforcement table with indexes and tenant isolation/RLS if needed.
- Add shared quota resolver/reservation service behind the runtime integration.
- Integrate DB runtime step-start enforcement.
- Integrate Temporal activity step-start enforcement.
- Add quota wait/resume handling.
- Add scheduled resume scan job.
- Add manual resume action and minimal UI/API surfacing.
- Add reconciliation/reporting helper.
- Backfill is not required because enforcement starts from newly created usage period rows. Historical step-row reconciliation can report prior usage if needed.
Rollout should be staged so that quota resolution and counter creation can be tested before hard enforcement is enabled if a feature flag or environment toggle is desired during deployment.
Risks
- Counter drift could occur if any runtime path creates step rows without using the quota service.
- Concurrent resume scans could resume more runs than remaining quota, but runtime reservation still protects final enforcement.
- Stripe sync gaps could cause fallback-calendar periods to be used unexpectedly.
- Temporal workflow behavior must treat quota as a controlled wait/pause, not an activity failure that causes unintended retries.
- Existing UI may not clearly distinguish quota wait from other waits until minimal UI surfacing is added.
Open Questions
- Should zero-limit metadata ever be valid, or should it remain invalid and fall back to tier default? Current design treats zero as invalid.
- What exact permission should gate manual quota resume if no dedicated workflow-run operation permission exists?
- Should rollout include an explicit enforcement feature flag, or is the plan to enforce immediately once shipped?
Acceptance Criteria (Definition of Done)
- A tenant's workflow step usage is counted in
workflow_step_usage_periodsper resolved payment period. - The runtime atomically reserves one unit before each step attempt across DB and Temporal engines.
- A tenant at the finite limit is quota-paused before a new step starts.
- Quota-paused runs have
workflow_runs.status = 'WAITING', preservenode_path, and have aworkflow_run_waits.wait_type = 'quota'record. - Quota-blocked steps do not create
workflow_run_stepsrows and do not increment usage. - Unlimited tenants continue executing while usage is still recorded.
- Stripe metadata overrides and tier defaults resolve as specified.
- Missing Stripe periods use the UTC calendar month fallback with tier defaults.
- Scheduled resume scan resumes eligible quota-paused runs without bypassing runtime quota checks.
- Manual resume resumes eligible quota-paused runs and returns a helpful exhausted-quota response otherwise.
- Tests cover quota resolution, atomic reservation, DB runtime enforcement, Temporal runtime enforcement, resume behavior, and reconciliation drift detection.