Ninety percent of production failures don’t announce themselves with red alerts and paging engineers at 3 AM. They whisper. They hide inside a build script that reports success while serving up degraded performance, and by the time anyone notices, it’s been running that way for days.
This is the story of what happened when a team using Remotion for server-side video rendering discovered that their entire snapshot caching strategy had been silently broken in production—and how the fix exposed something most developers get wrong about build automation.
The Setup: Why This Architecture Matters
Remote video rendering at scale is hard. Bundling a Remotion project from scratch on every render request is slower than molasses in January, so this team made a smart architectural choice: pre-bundle everything at deploy time, upload it to a Vercel Sandbox, take a snapshot, and reuse that snapshot for all subsequent renders. Fast cold starts. Less compute. Fewer timeouts.
Beautiful on paper. Fragile in practice.
Bug #1: The Path That Trusted the Wrong Thing
Here’s the thing about relative paths in build scripts—they’re a trap that feels safe until your environment changes. The code looked reasonable enough:
const BUNDLE_DIR = ".remotion";
await addBundleToSandbox({ sandbox, bundleDir: BUNDLE_DIR });
Works locally? Check. Works in most CI systems? Absolutely. Works in a random Vercel build environment where the Node.js current working directory has drifted from what you’d expect? Nope.
“The upload silently succeeds with zero or wrong files, and the snapshot is empty.”
That’s the killer detail. No error. No warning. Just an empty snapshot that nobody noticed for weeks while the system degraded to slow, full re-bundles on every request. The fix is genuinely trivial—anchor everything to __dirname or import.meta.url in ESM—but the lesson is unforgiving: never trust that your script’s current working directory is where you think it is. It’s an environment variable, not a constant.
One line of code. Two weeks of silent production failure.
Bug #2: The Nested Folder Surprise
Once the path issue was squashed, a second problem surfaced: addBundleToSandbox was exploding when it encountered the public/ subdirectory.
Remote’s bundler copies your project’s public/ folder into the output by default—sensible, right? Except the Remotion Vercel sandbox API’s mkDir doesn’t create parent directories recursively. It expects a flat structure. Any nested path like public/fonts/Inter.woff2 triggers an error because public/ wasn’t created first.
Two ways to fix it exist. The team went with a belt-and-suspenders approach: tell the bundler not to copy public/ at all (publicDir: null), and defensively delete it afterward in case future versions change the default. Paranoid? Maybe. But paranoia about build systems is justified.
Bug #3 (The Real One): Silencing Failure
Here’s where it gets interesting. Both bugs were causing the snapshot script to error out. Both. But nobody noticed, because the build step had error suppression baked in:
"vercel-build": "next build && node scripts/create-snapshot.mjs || echo '[create-snapshot] Skipped (non-fatal)'"
That || echo fallback was added during initial development to avoid breaking staging deploys when a token was missing. But in production, it meant snapshot failures were completely silent. The build reported success. Workers tried to restore a snapshot that didn’t exist. Renders fell back to full re-bundles (slow, occasionally running out of memory). Everything worked—just very, very poorly.
The fix: remove the fallback. Let the build fail when the snapshot step fails. If it’s required for production correctness, it shouldn’t be optional.
Why This Matters Beyond This One Team
Most developers think about build failures as loud, obvious things. But the most dangerous failures are the ones that don’t fail—the ones that succeed while being silently wrong. A snapshot script that runs successfully but produces garbage. An upload that returns 200 OK while uploading the wrong files. A deployment that goes green while your system slowly degrades.
This is why build automation requires a different mindset than normal application code. Build scripts run in weird environments. CWDs drift. APIs have undocumented flat-file assumptions. Error suppression patterns that make sense in development become foot-guns in production.
The real lesson here isn’t “use absolute paths” or “know your dependencies.” It’s this: treat your build pipeline like the critical infrastructure it is. If a step matters for correctness, let it fail visibly. If it’s truly optional, make that optionality explicit and monitored. Don’t hide failures behind fallback messages that make everything look fine.
The Aftermath
All three fixes shipped in one commit. The snapshot build is now fast, reliable, and loud when something goes wrong. But it took a week of production degradation to discover them—not because the bugs were hard to find, but because the failure modes were designed to be invisible.
That’s the sneaky part. In most systems, silent partial failures are worse than loud catastrophic ones. Loud failures get fixed immediately. Silent ones accumulate in production until someone notices the performance metrics have drifted.
🧬 Related Insights
- Read more: Why AI Agents Are About to Disrupt Retail’s $100 Billion Markdown Problem
- Read more: Gemma 4 is Finally Open Source—Here’s What Actually Works
Frequently Asked Questions
What is Remotion Vercel Sandbox? It’s a feature that lets you pre-bundle Remotion video projects at deploy time and cache them in a sandbox, so rendering workers can reuse the snapshot instead of re-bundling from scratch on every request. It’s designed for fast cold starts.
Why do relative paths fail in build scripts?
Build scripts execute in different environments with different working directories. A relative path like .remotion might resolve to the project root locally, but somewhere else (or nowhere) in CI/CD. Always anchor paths to __dirname or import.meta.url to make them independent of environment.
Should all build failures cause deployments to fail? Yes, unless the step is genuinely optional. Build steps that affect production correctness must fail the deployment if they fail. Use error suppression only for truly non-critical tasks, and make sure that choice is intentional and documented.