engineeringpdf
February 3, 2026

Why PDF pipelines always break after you ship

You shipped a PDF feature and it worked in staging — until it didn't. PDF pipelines fail in predictable, boring ways; this explains the real failure modes and practical fixes teams skip.

DuckSlideSir Quackalot

The problem you already recognize

You pushed a feature that generates invoices, reports, or contracts as PDFs. Staging looked fine. Production starts returning corrupted files, blank pages, or slow responses two weeks later. The usual mental checklist—did we change the template? Did the UI team ship a CSS tweak?—only gets you so far. Engineers spend days rotating through containers, reinstalling fonts, and tweaking headless Chrome flags. The real problem isn't that PDFs are magical; it's that the pipeline you built is fragile in predictable ways. If you recognize this scenario, this post is for you.

Common failure modes (with a concrete example)

There are a handful of failure modes that recur across teams. They are boring and avoidable, but they compound when you treat your PDF pipeline like a feature rather than a small, independent service.

  1. Environment drift and missing assets Containers in CI and staging often have fonts, system libraries, or CLDR data that production images lack. A missing font can change page breaks, push content off pages, or make layout engines fall back to fonts that aren't subset, inflating file size.

  2. Headless renderer instability Headless browsers (Chromium, wkhtmltopdf, Prince) are stateful and sensitive to concurrency, memory pressure, and timing. Under load you see random render failures, partial pages, or zombie processes consuming memory.

  3. Content edge cases Large embedded images, long unbroken strings, or unexpected HTML from users produce very long render times or OOMs. Signature images or base64 blobs multiply memory usage inside the renderer.

  4. Silent failures and retries Worker loops that swallow exceptions or implement aggressive retries can flood queues or produce duplicate PDFs. Swallowed DOM errors look like success until customers open a blank document.

Concrete example: architecture and a failure

Common architecture:

  • API receives PDF request (template id, data)
  • Request enqueued to message queue (SQS, Kafka)
  • Worker pulls job, renders HTML from template, calls headless renderer, stores PDF to object store (S3), updates DB

Example failure scenario:

  • Team deploys a CSS change that introduces a font-family relying on a new system font.
  • Production image didn't include that font. Renderer falls back to a variable-width default.
  • The new metrics layout causes an extra page for many invoices, pushing a footer off-page.
  • The worker's error handling expects nonzero exit code on failure. The renderer returns success with a malformed PDF. The worker marks job complete.

Pseudocode that illustrates a fragile worker loop:

while (job = queue.pop()) { html = renderTemplate(job.template, job.data) // synchronous call to headless renderer result = runRenderer(html) if (result.exitCode == 0) { store(result.pdf) db.markDone(job.id) } else { // retry with backoff queue.retry(job) } }

Why this breaks: runRenderer can return exitCode 0 while producing an incomplete PDF, or it can succeed but run out of memory intermittently when input images are huge. The loop marks the job done and there's no downstream alerting for malformed PDFs. By the time anyone notices, a large batch of customers received bad documents.

Why tests and monitoring miss the point

Tests and monitoring are the usual culprits teams lean on, but they often miss the real failure modes.

Unit tests for template rendering catch syntax errors but not broken fonts or memory spikes. Integration tests that spin up a single renderer process won't reveal concurrency-induced crashes. Load tests can show high latency but rarely reproduce the exact mix of content that triggers a renderer OOM.

Monitoring usually tracks request latency and error rate. Those metrics are useless if the renderer returns an exitCode 0 but produces garbage. File integrity checks are rare because teams assume S3 uploads are atomic and correct.

Here are the blind spots:

  • No content fuzzing: tests don't include the malformed or extreme data that shows up in production.
  • No PDF validations: uploaded PDFs are assumed good; nobody checks page count, text extraction, or PDF structure.
  • No resource isolation: renderer runs with unlimited memory, so one large job affects others.

When these blind spots combine—stateful renderer, mixed content, permissive retry logic—you get slow, silent degradation instead of loud failures. Degradation is much costlier because it erodes confidence and forces firefighting sprints.

Checklist

  • Pin renderer and OS packages; build immutable renderer images
  • Include all required fonts in the image
  • Run renderers with memory and CPU limits
  • Route large or image-heavy jobs to a separate queue
  • Perform quick PDF sanity checks before marking job success
  • Store failing HTML and renderer logs for postmortem
  • Fuzz templates and data in CI for edge cases
  • Use idempotent job keys and atomic uploads to avoid duplicates

Closing

PDF failures are almost never mysterious. They come from environment drift, stateful renderers, and silent success paths. Make your renderer predictable, validate outputs, and isolate resources. If you want a low-friction way to adopt these patterns, evaluate an API-first renderer like DuckSlide.

Join the Developer Beta

Get early access and help shape the future of document generation