Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
5.8 KiB
Build memory measurement harness — design
Date: 2026-06-04
Branch: improve/build-memory-consumption
Goal: A repeatable tool that runs npm run build, verifies the build works, and
measures peak memory consumption of the whole build — so we can drive a
build-memory optimization loop with before/after numbers.
Background / what the build is
npm run build (from repo root) is a three-stage chain:
build:assemblyscript—node scripts/build-assemblyscript-if-needed.mjsnpx nx build-deps server— builds shared/dependent workspace packagescd server && next build --turbo— the heavy stage (NODE_OPTIONS=--max-old-space-size=8192, Next.js 16, community edition)
The build is a process tree (npm → nx → next → worker processes), so a meaningful peak must cover the whole tree, not a single process.
Key findings that shaped the design
nodeon this host is a snap (/snap/bin/node→snap run). snap relocates every node process into its ownsnap.node.node-*.scopecgroup, escaping anysystemd-run --user --scopewrapper. So the clean "wrap the build in one scope, read itsmemory.peak" approach does not work on the host — the build's node processes scatter across snap-managed cgroups.- Running the snap-internal node ELF directly (
/snap/node/current/bin/node) avoids relocation but does not run node correctly (needs snap's runtime env). - Docker fixes this cleanly. Inside a container,
nodeis a normal ELF (no snap), and the entire container runs in one cgroup that exposesmemory.peakon cgroup v2. Verified on this box: a 300 MB allocation in a container registered asmemory.peak≈ 313 MB even aftermemory.currentfell back — i.e.memory.peakcaptures the true whole-tree high-water mark with no sampling. This is also the representative number: CI builds images in containers, so the container peak is what OOMs under a memory limit. - Host node is v24; project pins node 20 for runtime. The host
node_modules(≈3.8 GB) has native addons built for node 24's ABI, so they will not load under node 20. Decision: use anode:24-bookwormcontainer and reuse the hostnode_modulesas-is (zero install, fast loop). This reproduces the host build exactly, isolated in a container for clean cgroup measurement. Verified the hostnode_modulesload innode:24-bookworm(container glibc 2.36 < host 2.43, but the prebuilt addons target old glibc):next/dist/build/swcrequires OK,next --version→ 16.2.6,esbuildworks. (CI uses node:20; absolute numbers may differ slightly from CI — acceptable for a relative before/after optimization loop.)
Architecture
Two files, siblings of the existing scripts/build-perf-harness.mjs:
scripts/build-mem.sh — host wrapper (bash)
Host node is snap, so the wrapper is bash and only shells out to docker:
docker run --rm -v <repo>:/work -w /work [--memory <limit>] <image> \
node scripts/build-mem-harness.mjs <flags>
- Default image
node:24-bookworm;--imageto override. --memory(optional) passes through to docker to test a memory ceiling (e.g.--memory 8g→ "does the build fit in 8 GB?"). Unset = all host RAM.- All other flags pass through to the harness.
- cgroupns is docker's default (private), so the container's
/sys/fs/cgroupis its own cgroup root andmemory.peakis the whole-container high-water mark.
scripts/build-mem-harness.mjs — runs inside the container
- Clear (default;
--skip-clear): removeserver/.nextandserver/tsconfig.tsbuildinfofor a representative cold build. - Build: spawn
bash -lc '<build-cmd>'(defaultnpm run build) from/work, tee stdout/stderr to.build-mem/build-<label>.log. - Sampler (~150 ms;
--interval-ms): BFS the build's/procdescendant tree, sum PSS (/proc/<pid>/smaps_rollup, avoids double-counting shared pages), tag each sample by stage (precedence next-build > build-deps > assemblyscript, detected from cmdlines). Tracks per-stage peak, the global-peak sample's per-process snapshot, and a timeline. - Headline: on build exit, read
/sys/fs/cgroup/memory.peak(bytes) — the authoritative whole-container peak. Container is fresh per run, so it reflects only this build (the harness/clear steps are negligible vs an 8 GB build). - Verify (exit 0 + artifacts): build must exit 0 and produce
server/.next/BUILD_ID(+ best-effort manifest checks). Non-zero on any failure so a loop driver detects regressions. - Output: human summary (cgroup peak headline + per-stage PSS breakdown +
top processes at peak + duration + verify table), a single
[BUILD-MEM RESULT] {json}line, and.build-mem/result-<label>.json+.build-mem/timeline-<label>.csvfor before/after diffing.
Division of labor
- cgroup
memory.peak(container) = rock-solid headline number. - PSS sampler = attribution only (which stage/process drives the peak — what you actually optimize). Sampling can miss sub-150 ms spikes, but build peaks are sustained over seconds, so the sampler's role as attribution (not the headline) makes this immaterial.
Flags
--build-cmd <cmd>, --label <name>, --skip-clear, --interval-ms <n>,
--json-only (harness); --image <ref>, --memory <limit> (wrapper).
Out of scope (covered elsewhere / YAGNI)
- Booting the server / route smoke test — the existing
build-perf-harness.mjsalready does that (needs postgres/redis). - Per-stage cgroup-isolated peaks (running each stage as its own container) — the PSS sampler covers per-stage attribution; revisit only if sampler attribution proves too coarse.
Artifacts
.build-mem/ (gitignored): build-<label>.log, result-<label>.json,
timeline-<label>.csv.