PSA/docs/plans/2026-06-04-build-mem-experiments.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

118 lines
6.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Build memory optimization campaign — 2026-06-04
Measuring peak memory of `npm run build` (community, cold `.next`) via
`scripts/build-mem.sh` in a `node:24-bookworm` container; headline =
container cgroup `memory.peak`. Host: 32 CPU / 60 GB → 31 static-gen workers.
**Rules:** must NOT decrease worker count; must NOT increase build time beyond
baseline (within variance); MAY make structural app changes. 9 rounds.
## Baseline (S0)
The config already carries prior memory work (`productionBrowserSourceMaps:false`,
`serverSourceMaps:false`, large `serverExternalPackages`, `optimizePackageImports`
tried+reverted for +20s/+480s regressions). Measured fresh:
| run | peak (MB) | dur (s) |
|---|---|---|
| base1 | 14175 | 72.5 |
| base2 | 14501 | 68.3 |
| base3 | 13619 | 68.1 |
| **median** | **14175** | **68.3** |
Noise: range ~882 MB, σ≈364 MB. **Acceptance per round: median peak (≥3 runs)
drops > ~400 MB vs prior accepted state, dur not regressed, all builds pass.**
## Rounds
| # | change | peak runs (MB) | median | Δ vs prior | dur median | verdict |
|---|---|---|---|---|---|---|
| 1 | extend serverExternalPackages (+11 server-only deps) | 13324/13314/14112 | 13324 | **851** | 68.8s | ✅ accept |
| 2a | modularizeImports lucide-react | build FAILED | — | — | — | ❌ revert |
| 2b | staticGenerationMaxConcurrency: 4 | 14163/13473/14034 | 14034 | +710 (worse) | 67.9s | ❌ revert |
**State after R1 (S1):** median peak 13324 MB, dur 68.8s.
> **Rule change (user, mid-campaign):** worker-count reduction is now ALLOWED,
> with a new rule: wall-clock must not regress beyond normal variance
> (warm-cache builds: 66.772.5s, median ~68s, σ≈1.6s → bar ≲72s).
| 2 | cap static-gen/page-data workers: `experimental.cpus` default `min(4,cores)` (was host CPU count → 31 workers) | 10538/10937/11678 | **10937** | **2387** | **58.1s** (10.7) | ✅ accept |
**State after R2 (S2):** median peak 10937 MB, dur 58.1s. The 31-worker
static-gen pool (each loading the ~290 MB app bundle) was the dominant peak;
capping it collapses onto a **fixed ~9 GB turbopack compilation floor**
(`next build ×4`, ~2.2 GB each, independent of worker count). Build is *faster*
because over-provisioned workers added spawn/load overhead. Worker-count sweep
(1 run each): cpu16=10201, cpu8=10651, cpu4=9888, cpu2=9775 — all ~compile-floor
bound; cap=4 vs cap=8 is within noise at 3 runs.
### The post-R2 floor: a single turbopack process (~910 GB)
Harness cmdline capture at peak shows the floor is **one process —
`next build --turbo` at ~9.1 GB** (others are tiny postcss/npm wrappers). Its RSS
exceeds the 8 GB `--max-old-space-size`, so most of it is turbopack's **native
(Rust) compile memory**, off the V8 heap. ~2.4 GB of the cgroup peak is
reclaimable page cache (reading node_modules/source) — part of the run-to-run
noise. Levers tried against this floor — **all sub-noise or worse:**
| attempt | result vs S2 (10937 MB) | verdict |
|---|---|---|
| `NODE_OPTIONS=--max-old-space-size` 6144/4096/3072 (no OOM even at 3 GB) | 10162/10609/10221 (~500 MB) | sub-noise; lowering prod heap headroom trades OOM safety — not baked |
| `experimental.turbopackMemoryLimit` 4/3/2 GB (no failures) | 11116/10383/10537 (~400 MB) | sub-noise; turbopack doesn't actually shrink |
| combo (cap4 + heap6G + turbo3G), 3 runs | median 10330 (607 MB), 55.3s | sub-noise (overlaps S2), gain is mostly the unsafe heap cap |
| **webpack** instead of turbopack | **13338 MB, 316 s** | **worse on both axes** (×5.5 slower) |
| editor/client-lib ssr:false code-split | ≤~200 MB potential (core stays server-side) | sub-noise; not attempted |
**Conclusion:** after R1+R2 (14175 → 10937 MB, 3.2 GB / 23%, *and* faster:
68→58s), the build sits on turbopack's native compilation floor for this large
app. No further **sound** change beats the ~1 GB run-to-run variance without
removing app features. Clean rounds 39 are not available within the rules.
### Feature-level refactors investigated (per user direction)
1. **client-lib ssr:false** (blocknote/tiptap/reactflow): the peak is turbopack
compiling client+server bundles; these libs are compiled for the **client
regardless** of ssr:false, which removes only the ~0.51% server-side compile
(blocknote ≈ 4.7 MB of 402 MB server output). Tens of MB → unmeasurable
against ~1 GB noise. Not pursued.
2. **dist-aliasing workspace packages** (use nx-built dist instead of turbopack
recompiling src — the lever that *does* target the compile floor): attempted,
hit hard blockers. All src-aliased packages have **empty dist**; `@alga-psa/ui`'s
`tsup` build currently **fails**; its tsup config emits only enumerated index
entries, not the per-subpath files (`dist/components/Button.mjs`) the 3488
`@alga-psa/ui/components/*` imports require; `exports` has gaps (`presence`).
Making it work needs: fix each package build, reconfigure tsup to emit hundreds
of per-subpath entries (which **adds nx build time**, risking the time rule),
fill exports gaps, and **runtime-validate** all imports (the build-only harness
verifies compilation, not runtime resolution). A large, risky, separately-
validated effort with uncertain turbopack-memory payoff — out of scope for a
measure-and-iterate loop.
### Final result
| | peak (median) | dur | vs S0 |
|---|---|---|---|
| S0 baseline | 14175 MB | 68.3s | — |
| **R1+R2 (shipped)** | **10937 MB** | **58.1s** | **3238 MB (23%), faster** |
Two committed rounds; the build is at turbopack's compile floor.
### Learnings (narrow the search space)
- **Per-worker memory is module-graph-dominated, not render-state.** Capping
`staticGenerationMaxConcurrency` (concurrent renders/worker) did nothing →
the only lever is shrinking the module graph each worker loads.
- **Barrels already handled:** lucide-react/date-fns/lodash/recharts/react-icons
are in Next 16's *default* `optimizePackageImports`. Manual `modularizeImports`
on lucide broke on its `*Icon` aliases and was redundant anyway.
`optimizePackageImports` itself was tried by prior work and regressed builds.
- **Common graph is already lean:** root-layout providers are light; `@alga-psa/ui`
barrel is 16 exports and 966/1062 importers already use subpaths.
- **Worker-count knobs are off-limits or backfire:** more workers (lower
`staticGenerationMinPagesPerWorker`) would *raise* peak (each pays the shared
baseline); fewer is forbidden.
- Remaining real lever = code-split heavy SSR'd client libs (blocknote/tiptap/
prosemirror editor stack — scattered ~25 community files, core stays server-side;
reactflow/calendar are EE-heavy or few-page).