How We Eliminated Puppeteer Memory Issues by Switching to an API

We spent four months fighting a puppeteer memory leak in our screenshot pipeline before we admitted the problem wasn't fixable. Not in the "we're bad at code" sense. In the "Chrome wasn't designed to run as a server" sense. We switched to SnapRender (the integration looks like the Node.js quickstart), cut our infrastructure cost, and gave two engineers their weekends back. Here's the full story with actual numbers.

The Setup

We run a SaaS that generates weekly reports for e-commerce stores. Each report includes 8-12 screenshots of competitor websites, product pages, and search results. At peak, we process about 50,000 screenshots per month across 4,000 customer reports.

The original architecture looked like this:

┌─────────────┐     ┌──────────────────┐     ┌────────────┐
│  Job Queue   │────▶│  Worker (x3)     │────▶│  S3 Bucket │
│  (BullMQ)    │     │  Node.js +       │     │  (output)  │
│              │     │  Puppeteer +     │     │            │
│              │     │  Chrome          │     │            │
└─────────────┘     └──────────────────┘     └────────────┘
                          │
                    Docker containers
                    on 3x 4GB VMs

Three worker containers, each running Puppeteer with headless Chrome. Each container had a 4GB memory limit. The math seemed fine: 50K screenshots / 30 days / 3 workers = ~550 screenshots per worker per day. Should be easy.

The Problems Started at Month Two

First month was fine. Month two, we started seeing OOM kills. A container would start at 500MB RSS, process screenshots for 8-12 hours, and eventually hit the 4GB limit. Docker would kill it, Kubernetes would restart it, and we'd lose whatever screenshots were in progress.

Our monitoring showed the classic puppeteer memory leak pattern:

Hour 0:   RSS 480MB   (Chrome baseline)
Hour 2:   RSS 890MB   (+410MB)
Hour 4:   RSS 1.3GB   (+410MB)
Hour 6:   RSS 1.7GB   (+400MB)
Hour 8:   RSS 2.2GB   (+500MB)  ← GC getting less effective
Hour 10:  RSS 2.9GB   (+700MB)  ← fragmentation accelerating
Hour 12:  RSS 3.8GB   (+900MB)
Hour 12.5: OOM KILLED

Memory grew at roughly 200MB per hour of continuous processing, accelerating as V8 heap fragmentation got worse.

The Band-Aids

We tried every fix the internet recommends. Here's what we did and what actually happened:

Band-Aid 1: Aggressive page.close()

The most common advice for a puppeteer memory leak. We audited every code path and made sure page.close() was called in every finally block. We even added a cleanup sweep that closed any pages left open after 30 seconds.

// We added this sweeper
setInterval(async () => {
  const pages = await browser.pages();
  for (const page of pages) {
    if (page.url() !== 'about:blank') {
      await page.close().catch(() => {});
    }
  }
}, 30000);

Result: Memory growth slowed from 200MB/hour to 150MB/hour. Still hit OOM, just took 14 hours instead of 12.

Band-Aid 2: Browser Recycling

We restarted Chrome every 100 requests.

let requestCount = 0;
const MAX_REQUESTS = 100;

async function getPage() {
  if (requestCount >= MAX_REQUESTS) {
    await browser.close();
    browser = await puppeteer.launch(launchOptions);
    requestCount = 0;
  }
  requestCount++;
  return browser.newPage();
}

Result: This actually helped. Memory dropped back to ~500MB every 100 requests. But it introduced a new problem: the 3-5 second gap during browser restart where requests would fail. We added queuing to handle it, which meant more code to maintain.

Memory still grew between recycles (500MB to 1.2GB over 100 requests), so we dropped to recycling every 50 requests. Then 30. Each restart discarded warm caches, making screenshots slower.

Band-Aid 3: Reduced Concurrency

We dropped from 5 concurrent pages per worker to 2. Less memory pressure, but throughput tanked. We had to add a fourth worker to compensate, which meant more infrastructure cost.

Band-Aid 4: Blocked Resources

We blocked fonts, media, and stylesheets we didn't need:

await page.setRequestInterception(true);
page.on('request', req => {
  if (['font', 'media', 'stylesheet'].includes(req.resourceType())) {
    req.abort();
  } else {
    req.continue();
  }
});

Result: Each page used less memory, but screenshots looked wrong without stylesheets. We had to bring CSS back. Font blocking saved maybe 10-20MB per page, which helped but didn't solve the underlying fragmentation.

The Real State After Band-Aids

After implementing everything:

Metric	Before Fixes	After All Band-Aids
Time to OOM	12 hours	Never (recycling prevents it)
Restart frequency	N/A	Every 50 requests
Workers needed	3	4 (reduced concurrency)
Avg screenshot time	3.2s	4.8s (cold caches from recycling)
Failed screenshots	2-3% (OOM)	1-2% (restart gaps)
Monthly infra cost	$180	$240 (extra worker)

We stopped the OOM kills, but at the cost of slower screenshots, more infrastructure, and a fragile system. Any change to the codebase risked breaking the carefully tuned recycle thresholds.

The Human Cost

Here's what doesn't show up in monitoring dashboards: engineering time.

Two of us spent roughly 10-15 hours per month on Puppeteer-related work:

Tuning recycle thresholds when we onboarded customers with heavier pages
Debugging screenshot failures caused by restart timing
Updating Chrome/Puppeteer versions (which sometimes changed memory behavior)
Investigating "why are screenshots slow today" (answer: always memory pressure)
Waking up to OOM alerts when a customer added pages that used more memory than expected

At our loaded engineering cost, that's $3,000-4,000/month of salary going to babysitting a screenshot pipeline. For a deeper breakdown of these hidden costs, see The Real Cost of Self-Hosting Screenshots.

Evaluating the Alternatives

We looked at three screenshot APIs. Our requirements: handle 50K screenshots/month, support PNG and PDF output, provide reasonable response times, don't cost more than our current infrastructure.

ScreenshotOne

Pricing at our volume: their Growth plan covers 25K screenshots for $79/month, but we needed 50K. That meant the Business plan at $259/month. API was solid, documentation was good. But the price was hard to justify.

Urlbox

Pricing at our volume: their plans start at $49/month for 5,000 screenshots. At 50K, we'd need a custom plan. Based on their per-screenshot pricing, the estimate was $200-250/month. Good feature set, but we'd be paying for features (thumbnails, retina) we didn't need.

SnapRender

Pricing at our volume: the Business plan at $79/month covers 50,000 screenshots. That's exactly our volume. No feature gating, so every plan gets the full API: device emulation, ad blocking, cookie banner removal, dark mode, full-page capture. For a detailed pricing comparison across providers, check Screenshot API Pricing Compared.

API	Monthly Cost (50K)	Per-Screenshot
ScreenshotOne	$259	$0.0052
Urlbox	~$225	~$0.0045
SnapRender	$79	$0.0016
Self-hosted (infra only)	$240	$0.0048
Self-hosted (incl. eng time)	$3,500+	$0.070

SnapRender was cheaper than our self-hosted infrastructure cost, and dramatically cheaper when you factor in engineering time.

The Migration

I expected the migration to take a week. It took one afternoon.

Our Puppeteer calls looked like this:

async function captureScreenshot(url, options) {
  const page = await getPage();
  try {
    await page.setViewport({
      width: options.width || 1280,
      height: options.height || 720,
    });
    await page.goto(url, {
      waitUntil: 'networkidle0',
      timeout: 30000,
    });
    const buffer = await page.screenshot({
      type: options.format || 'png',
      fullPage: options.fullPage || false,
    });
    return buffer;
  } finally {
    await page.close().catch(() => {});
  }
}

The SnapRender replacement using their npm SDK:

const { SnapRender } = require('snaprender');
const client = new SnapRender({ apiKey: process.env.SNAPRENDER_API_KEY });

async function captureScreenshot(url, options) {
  const response = await client.capture({
    url,
    format: options.format || 'png',
    width: options.width || 1280,
    height: options.height || 720,
    full_page: options.fullPage || false,
    block_ads: true,
    remove_cookie_banners: true,
  });
  return response.buffer();
}

The function signature stayed the same. The callers didn't need to change. We dropped the browser pool, the recycle logic, the memory monitoring, the cleanup sweeper, and the page timeout watcher. About 400 lines of infrastructure code deleted.

We also got block_ads and remove_cookie_banners for free. Those were features we'd wanted but never had time to implement in Puppeteer (they require maintaining filter lists and injecting CSS/JS).

The After Architecture

┌─────────────┐     ┌──────────────────┐     ┌────────────┐
│  Job Queue   │────▶│  Worker (x1)     │────▶│  S3 Bucket │
│  (BullMQ)    │     │  Node.js         │     │  (output)  │
│              │     │  (HTTP calls     │     │            │
│              │     │   to SnapRender) │     │            │
└─────────────┘     └──────────────────┘     └────────────┘
                          │
                    Single container
                    512MB memory limit

One worker. 512MB memory limit. No Chrome. No Puppeteer. Just HTTP requests to SnapRender's API.

Results After Three Months

Metric	Self-Hosted Puppeteer	SnapRender API
Workers	4 containers, 4GB each	1 container, 512MB
Monthly infra cost	$240	$20 (tiny VM)
API cost	$0	$79
Total monthly cost	$240 + eng time	$99
Avg screenshot time (fresh)	4.8s	2-4s
Avg screenshot time (cached)	N/A	<200ms
Failed screenshots	1-2%	<0.1%
OOM incidents/month	0 (with recycling)	0 (no browser)
Eng hours on screenshots	10-15/month	<1/month
Memory monitoring code	400+ lines	0 lines

The cached response time was a pleasant surprise. SnapRender caches screenshots with a configurable TTL. Since our competitor analysis often screenshots the same pages across multiple customer reports, about 40% of our requests hit the cache and return in under 200ms. Our Puppeteer setup never had caching because building a reliable cache layer was another project we never got to. (For more on how caching changes the performance equation, see From Timeout Hell to 200ms.)

What We Kept Puppeteer For

We didn't eliminate Puppeteer entirely. We still use it for:

E2E testing: Puppeteer runs our integration tests in CI. It processes maybe 50 pages per test run, so the puppeteer memory leak never becomes an issue in short-lived processes.
One-off scraping: When we need to extract structured data from a page (not just a screenshot), Puppeteer's evaluate() is still the right tool.

The key distinction: Puppeteer is fine for short-lived, low-volume tasks. It becomes a problem when you run it as a long-lived service processing thousands of requests.

The Math That Made the Decision

If you're evaluating whether to keep fighting your own puppeteer memory leak or switch to an API, here's the calculation:

Self-hosted cost per month:

Infrastructure: VMs, containers, load balancers
Engineering time: hours/month * loaded hourly cost
Incident cost: downtime events * impact

API cost per month:

Plan price at your volume
Minimal infrastructure for the API client

For us:

Self-hosted: $240 infra + ~$3,500 eng time = $3,740/month
SnapRender: $79 API + $20 infra + ~$300 eng time = $399/month

Even ignoring engineering time completely, the raw infrastructure cost was $240 vs $99. The API was cheaper before we even counted the hours we got back.

Migration Checklist

If you're considering the same move, here's what to plan for:

Audit your Puppeteer usage. Separate screenshots from browser automation (scraping, testing, form filling). Only screenshots move to an API.
Check your volume. SnapRender's pricing tiers: Free (200/mo), Starter $9 (2K/mo), Growth $29 (10K/mo), Business $79 (50K/mo), Scale $199 (200K/mo).
Test the API with your actual URLs. Some pages behave differently across screenshot services. Run your top 50 URLs through the SnapRender API and compare output quality. The complete API guide walks through every parameter.
Swap the implementation. If you have a clean abstraction over Puppeteer (a single captureScreenshot function), you can swap in one afternoon. If Puppeteer calls are scattered across your codebase, budget a day to create the abstraction first.
Remove infrastructure. Scale down workers, remove memory monitoring, delete the browser pool code. This is the satisfying part.
Monitor for a week. Watch error rates and screenshot quality. We found two edge cases in the first week (a site that required a specific user agent, and a page that needed a click before the content loaded). Both were fixable with SnapRender's API parameters.

The whole process took us about 6 hours of actual work spread over two days. The hardest part wasn't technical. It was admitting that the problem we'd spent four months on was never going to be fully solved by writing better code. Sometimes the right fix is to make it someone else's problem.