Back to Blog
Thought Leadership

Designing a Document Automation Architecture That Scales

January 20, 202615 min read

Designing a Document Automation Architecture That Scales

Every growing company eventually faces the same document challenge. It starts small: a developer writes a script to generate invoices. Then someone asks for receipts. Then contracts. Then compliance reports. Before long, PDF generation code is scattered across a dozen services, each with its own approach, its own bugs, and its own maintenance burden.

This article is about the architecture that prevents that mess. Whether you're building a new document system or refactoring a legacy one, these patterns will help you design something that scales — in volume, in template count, and in maintenance burden.

The Anatomy of a Document System

Every document automation system, from simple to enterprise-grade, has the same fundamental components:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Trigger    │───▶│   Template   │───▶│   Renderer   │───▶│   Storage    │
│   (Event)    │    │   (Layout)   │    │   (Engine)    │    │   (Output)   │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘
       │                   │                   │                   │
  Payment made      Invoice HTML +       HTML → PDF          S3, local,
  Order shipped     data merged          via Puppeteer       or email
  User completed    into final           or WeasyPrint       attachment
  a course          document

Let's look at each component in detail.

1. Triggers: When Documents Are Generated

Documents are generated in response to events. The most common patterns:

Synchronous (Blocking): A user clicks "Download Invoice" and waits for the PDF to be generated. The API endpoint blocks until the PDF is ready and streams it back.

User clicks → API generates PDF → Returns PDF → User sees download
Latency: 500ms - 5s

Asynchronous (Non-Blocking): An event triggers PDF generation in the background. The user is notified when it's ready, or the PDF is attached to an email.

Order confirmed → Job queued → Worker generates PDF → Email sent with attachment
Latency: 5s - 60s (but non-blocking)

Batch: Monthly statements, quarterly reports, or year-end tax documents — thousands of documents generated at once, often overnight.

Cron job triggers → 10,000 jobs queued → Workers process in parallel → All stored in S3
Latency: minutes to hours

Rule: Default to async. Synchronous generation only makes sense for on-demand user downloads. For everything else — emails, webhooks, batch processing — async is more reliable, more scalable, and gives you retry capabilities.

2. Templates: The Layout Layer

Templates are the hardest component to get right, because they sit at the intersection of design, data, and business logic.

Template Storage Strategies

Option A: Code-based templates (HTML/CSS in your codebase)

resources/
  templates/
    invoice/
      layout.html
      styles.css
      partials/
        header.html
        line-items.html
        footer.html

Pros: Version controlled, code-reviewed, deployable Cons: Requires a developer for every change

Option B: Database-stored templates

Templates stored as records in a database, editable through an admin UI.

Pros: Non-developers can edit, changes are instant Cons: Harder to version, harder to test, potential for broken templates

Option C: Hybrid (or use a dedicated service)

Base templates in code. Customizable elements (logos, colors, boilerplate text) in the database — or managed through a visual editor provided by a dedicated service like PDF-API.io.

class InvoiceTemplate {
    // Structure is in code (version-controlled)
    public function render(Invoice $invoice, TemplateConfig $config): string {
        return view('templates.invoice', [
            'invoice' => $invoice,
            'logo' => $config->logo_url,           // From database
            'accent_color' => $config->accent_color, // From database
            'footer_text' => $config->footer_text,   // From database
            'terms' => $config->payment_terms,       // From database
        ])->render();
    }
}

This hybrid approach is what most successful companies converge on. The structure is stable and tested; the customizable parts are safely constrained.

Template Versioning

If your documents have legal or compliance significance, you need template versioning. The invoice generated on January 1st should always look the same, even if you updated the template on February 1st.

interface TemplateVersion {
    id: string;
    template_id: string;
    version: number;
    content: string;
    created_at: Date;
    is_current: boolean;
}

When generating a document, record which template version was used:

interface GeneratedDocument {
    id: string;
    template_version_id: string;   // Which template was used
    data_snapshot: object;          // The data at generation time
    file_path: string;             // Where the PDF is stored
    generated_at: Date;
}

This gives you full traceability: for any document, you can answer "what template was used?" and "what data was provided?"

Data Contracts

Define explicit data contracts for each template. Don't pass raw Eloquent models or database rows — create dedicated data objects:

// ❌ Don't pass raw models
$html = view('templates.invoice', ['invoice' => $invoice])->render();

// ✅ Do define a data contract
class InvoiceTemplateData {
    public function __construct(
        public string $invoiceNumber,
        public string $issueDate,
        public string $dueDate,
        public CompanyData $seller,
        public CompanyData $buyer,
        /** @var LineItemData[] */
        public array $lineItems,
        public MoneyData $subtotal,
        public MoneyData $tax,
        public MoneyData $total,
        public ?string $notes = null,
        public ?string $purchaseOrderNumber = null,
    ) {}

    public static function fromInvoice(Invoice $invoice): self {
        return new self(
            invoiceNumber: $invoice->formatted_number,
            issueDate: $invoice->issue_date->format('F j, Y'),
            dueDate: $invoice->due_date->format('F j, Y'),
            seller: CompanyData::fromCompany($invoice->company),
            buyer: CompanyData::fromClient($invoice->client),
            lineItems: $invoice->items->map(
                fn ($item) => LineItemData::fromItem($item)
            )->all(),
            subtotal: MoneyData::from($invoice->subtotal, $invoice->currency),
            tax: MoneyData::from($invoice->tax_amount, $invoice->currency),
            total: MoneyData::from($invoice->total, $invoice->currency),
            notes: $invoice->notes,
            purchaseOrderNumber: $invoice->po_number,
        );
    }
}

This approach has several benefits:

  1. Templates can't accidentally access sensitive data (passwords, API keys, internal IDs)
  2. Formatting is done once in the data layer, not scattered across templates
  3. Testing is straightforward — create a data object and render
  4. Template designers see a clear interface — they know exactly what variables are available

3. The Rendering Pipeline

The renderer is where data meets template and a PDF comes out. Here's a clean pipeline pattern:

Input: TemplateData + TemplateVersion
                    │
                    ▼
          ┌─────────────────┐
          │   Merge Data    │  Inject data into template
          │   into Template │  (Blade, Twig, Mustache...)
          └────────┬────────┘
                   │
                   ▼ (HTML string)
          ┌─────────────────┐
          │    Pre-process   │  Resolve image URLs,
          │                 │  inline CSS, sanitize
          └────────┬────────┘
                   │
                   ▼ (Clean HTML)
          ┌─────────────────┐
          │   PDF Engine    │  Convert HTML → PDF
          │   (Puppeteer,   │  (or render from coords)
          │    WeasyPrint)  │
          └────────┬────────┘
                   │
                   ▼ (PDF bytes)
          ┌─────────────────┐
          │   Post-process  │  Add metadata, watermarks,
          │                 │  digital signatures
          └────────┬────────┘
                   │
                   ▼ (Final PDF)

Pre-processing

Before sending HTML to the PDF engine, you often need to:

  1. Inline CSS: External stylesheets referenced by <link> might not load in all engines. Inlining CSS guarantees it's applied.

  2. Resolve image URLs: Relative URLs need to be converted to absolute URLs or base64-encoded data URIs. Using file:// paths for locally stored images avoids network requests.

  3. Sanitize user content: If any part of the template includes user-provided content (notes, descriptions), sanitize it to prevent HTML injection.

  4. Apply conditional logic: Show/hide sections based on data (e.g., hide the discount row if there's no discount).

class HtmlPreprocessor {
    public function process(string $html): string
    {
        $html = $this->inlineCss($html);
        $html = $this->resolveImages($html);
        $html = $this->sanitizeUserContent($html);

        return $html;
    }

    private function resolveImages(string $html): string
    {
        // Convert relative URLs to absolute URLs
        // Or convert to base64 data URIs for self-contained HTML
        return preg_replace_callback(
            '/src="(\/[^"]+)"/',
            function ($matches) {
                $path = public_path($matches[1]);
                if (file_exists($path)) {
                    $mime = mime_content_type($path);
                    $data = base64_encode(file_get_contents($path));
                    return 'src="data:' . $mime . ';base64,' . $data . '"';
                }
                return $matches[0];
            },
            $html
        );
    }
}

Post-processing

After generating the PDF, you might need to:

  1. Add PDF metadata: Title, author, creation date, keywords
  2. Set security flags: Prevent printing, copying, or editing
  3. Add watermarks: "DRAFT", "CONFIDENTIAL", "COPY"
  4. Apply digital signatures: For legal validity
  5. Compress: Optimize file size for email attachments

4. Storage and Delivery

Once generated, PDFs need to go somewhere. The storage strategy depends on your use case:

Ephemeral (generate on demand, don't store): Used for previews or when real-time data accuracy is more important than performance. No storage cost, but can't audit or reference later.

Temporary storage (keep for N days, then delete): Used for downloads and email attachments. Store in S3 with a lifecycle policy:

{
    "Rules": [
        {
            "Status": "Enabled",
            "Expiration": { "Days": 30 },
            "Filter": { "Prefix": "generated-pdfs/temporary/" }
        }
    ]
}

Permanent storage (keep forever): Used for invoices, contracts, compliance documents. These are legal records.

s3://documents/
  invoices/
    2026/
      01/
        INV-2026-0001.pdf
        INV-2026-0002.pdf
      02/
        ...
  contracts/
    ...

Delivery Patterns

Direct download: Streaming the PDF to the user's browser.

return response()->streamDownload(function () use ($pdf) {
    echo $pdf;
}, 'invoice.pdf', [
    'Content-Type' => 'application/pdf',
]);

Signed URL: Generate a temporary, secure URL for download. The user can access the PDF for a limited time without authentication.

$url = Storage::temporaryUrl(
    'invoices/INV-2026-0001.pdf',
    now()->addMinutes(30)
);

Email attachment: Attach the PDF directly to an email. Note: Most email providers limit attachment size to 10-25MB.

Webhook notification: Notify external systems that a document is ready:

{
    "event": "document.generated",
    "document_id": "doc_abc123",
    "download_url": "https://...",
    "expires_at": "2026-02-11T12:00:00Z"
}

Scaling Patterns

Pattern 1: Queue-Based Processing

For any volume beyond a handful of PDFs per minute, use a job queue:

// Dispatch the job (non-blocking)
GenerateInvoicePdf::dispatch($invoice);

// The job
class GenerateInvoicePdf implements ShouldQueue
{
    public int $tries = 3;
    public int $backoff = 30;

    public function __construct(private Invoice $invoice) {}

    public function handle(PdfRenderer $renderer, Storage $storage): void
    {
        $data = InvoiceTemplateData::fromInvoice($this->invoice);
        $pdf = $renderer->render('invoice', $data);

        $path = "invoices/{$this->invoice->formatted_number}.pdf";
        $storage->put($path, $pdf);

        $this->invoice->update(['pdf_path' => $path]);

        // Notify or email
        SendInvoiceEmail::dispatch($this->invoice);
    }
}

Benefits:

  • Retry on failure: If the PDF engine crashes, the job retries automatically
  • Rate limiting: Control how many PDFs are generated simultaneously
  • Priority queues: Urgent documents (user downloads) get higher priority than batch jobs

Pattern 2: Worker Pool for Headless Browsers

If using Puppeteer/Playwright, managing browser instances is critical:

class BrowserPool {
    constructor(maxInstances = 4) {
        this.maxInstances = maxInstances;
        this.available = [];
        this.inUse = new Set();
        this.waiting = [];
    }

    async acquire() {
        if (this.available.length > 0) {
            const browser = this.available.pop();
            this.inUse.add(browser);
            return browser;
        }

        if (this.inUse.size < this.maxInstances) {
            const browser = await puppeteer.launch({
                headless: 'new',
                args: ['--no-sandbox', '--disable-dev-shm-usage'],
            });
            this.inUse.add(browser);
            return browser;
        }

        // All instances busy — wait
        return new Promise((resolve) => {
            this.waiting.push(resolve);
        });
    }

    release(browser) {
        this.inUse.delete(browser);
        if (this.waiting.length > 0) {
            const resolve = this.waiting.shift();
            this.inUse.add(browser);
            resolve(browser);
        } else {
            this.available.push(browser);
        }
    }
}

Pattern 3: Caching Compiled Templates

If the same template is used repeatedly (which it almost always is), cache the compiled version:

class TemplateCache {
    public function getCompiled(string $templateId, int $version): string
    {
        $key = "template:{$templateId}:v{$version}:compiled";

        return Cache::remember($key, 86400, function () use ($templateId, $version) {
            $template = Template::where('id', $templateId)
                ->where('version', $version)
                ->firstOrFail();

            return $this->compile($template->content);
        });
    }

    private function compile(string $rawTemplate): string
    {
        // Inline CSS, resolve static assets, optimize HTML
        return (new HtmlPreprocessor())->process($rawTemplate);
    }
}

This avoids re-reading, re-parsing, and re-inlining CSS for every single generation.

Pattern 4: Dedicated Generation Service

As your system grows, extract PDF generation into a standalone service:

┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│  Main App   │────▶│  Document Service │────▶│   Storage   │
│  (Laravel)  │     │  (Go/Node/Python) │     │   (S3)      │
└─────────────┘     └──────────────────┘     └─────────────┘
      │                      │
      │  POST /generate      │  Internal API
      │  {template, data}    │

Benefits:

  • Scale the document service independently from your main application
  • Use a language/runtime optimized for PDF generation (Go with native PDF libraries is very fast)
  • Isolate browser dependencies — your main application doesn't need Chromium installed
  • Different teams can own different services

If you don't want to build and maintain this service yourself, this is essentially what managed PDF APIs provide out of the box. Services like PDF-API.io give you the API endpoint, template management, and rendering engine without the operational overhead.

Error Handling and Resilience

PDF generation fails more often than you'd expect. Common failure modes:

  1. Rendering timeouts: Complex layouts or large images can exceed time limits
  2. Memory exhaustion: Headless browsers with large documents can OOM
  3. Font loading failures: External font CDNs being slow or down
  4. Invalid data: Null values, unexpected types, empty arrays causing template errors
  5. External service downtime: If using an API-based generator

Build Resilient Generation

class ResilientPdfGenerator
{
    public function generate(string $template, array $data): string
    {
        try {
            return $this->attemptGeneration($template, $data);
        } catch (TimeoutException $e) {
            // Retry with simplified template (no images, basic fonts)
            Log::warning("PDF generation timed out, retrying with simplified template", [
                'template' => $template,
            ]);
            return $this->attemptGeneration($template . '-simplified', $data);
        } catch (RenderException $e) {
            // Log the error with enough context to debug
            Log::error("PDF generation failed", [
                'template' => $template,
                'error' => $e->getMessage(),
                'data_keys' => array_keys($data),
            ]);
            throw $e;
        }
    }
}

Fallback Strategies

  1. Simplified templates: If the full template fails, render a text-only version
  2. Cached PDFs: If the document hasn't changed, serve the previously generated version
  3. Retry with exponential backoff: For transient failures (network issues, service restarts)
  4. Alert and skip: For batch processing, log the failure and continue with the next document

Monitoring and Observability

In production, you need visibility into your document pipeline:

Key Metrics

Metric What it tells you Alert threshold
pdf.generation.duration How long each PDF takes p95 > 10s
pdf.generation.errors How often generation fails > 1% error rate
pdf.generation.queue_depth How many PDFs are waiting > 1000
pdf.output.file_size How large the PDFs are > 10MB
pdf.generation.memory Peak memory during generation > 80% of limit

Structured Logging

{
    "event": "pdf.generated",
    "template": "invoice",
    "template_version": 3,
    "duration_ms": 1250,
    "file_size_bytes": 245000,
    "page_count": 2,
    "data_items": 15,
    "renderer": "weasyprint",
    "success": true
}

This makes it easy to query: "Show me all invoices that took more than 5 seconds to generate last week" or "What's the average file size for contract PDFs?"

Real-World Architecture Examples

E-Commerce: Shopify's Approach

Shopify generates millions of order documents daily. Their approach:

  • Templates are stored per-merchant in a template store
  • Rendering is done asynchronously via a dedicated service
  • PDFs are cached and served via CDN
  • Merchants can customize templates through a Liquid-based editor
  • Documents are generated on first access, not at order time

Financial: Stripe's Invoice Generation

Stripe generates invoices for millions of businesses:

  • Invoices are generated lazily (only when accessed for the first time)
  • HTML templates with data binding (no user-editable templates)
  • Multi-currency and multi-language support
  • PDF/A compliance for certain regions
  • Stored permanently with full audit trail

The Common Pattern

Both architectures share the same fundamentals:

  1. Clear separation between data, templates, and rendering
  2. Asynchronous generation for anything not user-initiated
  3. Caching at multiple levels (template compilation, generated PDF)
  4. Template versioning for audit trails
  5. Structured monitoring for operational visibility

Conclusion

Document automation is one of those domains that's deceptively complex. The first PDF is easy. The thousandth PDF, with its edge cases, compliance requirements, and scaling needs, is where architecture matters.

The key principles:

  1. Separate concerns: Data, templates, rendering, and storage should be independent, swappable layers
  2. Default to async: Only use synchronous generation for on-demand downloads
  3. Define data contracts: Don't pass raw database objects to templates
  4. Version everything: Templates, data snapshots, and generated outputs
  5. Plan for failure: Retries, fallbacks, and monitoring from day one
  6. Start simple, evolve intentionally: A well-structured monolithic approach is better than a premature microservice architecture

Whether you're building for 100 documents a month or 100,000 a day, these patterns provide a foundation you can grow into without rebuilding from scratch.


Want the architecture without the infrastructure? PDF-API.io gives you template management, a visual editor, versioning, and a rendering pipeline — all behind a simple API. Start building for free.

Ready to automate your PDFs?

Start generating professional documents in minutes. Free plan includes 100 PDFs/month.

Start for Free