How We Eliminated Puppeteer Memory Issues by Switching to an API
We spent four months fighting a puppeteer memory leak in our screenshot pipeline before we admitted the problem wasn't fixable. Not in the "we're bad at code" sense. In the "Chrome wasn't designed to run as a server" sense. We switched to SnapRender, cut our infrastructure cost, and gave two engineers their weekends back. Here's the full story with actual numbers.
The Setup
We run a SaaS that generates weekly reports for e-commerce stores. Each report includes 8-12 screenshots of competitor websites, product pages, and search results. At peak, we process about 50,000 screenshots per month across 4,000 customer reports.
The original architecture looked like this:
┌─────────────┐ ┌──────────────────┐ ┌────────────┐
│ Job Queue │────▶│ Worker (x3) │────▶│ S3 Bucket │
│ (BullMQ) │ │ Node.js + │ │ (output) │
│ │ │ Puppeteer + │ │ │
│ │ │ Chrome │ │ │
└─────────────┘ └──────────────────┘ └────────────┘
│
Docker containers
on 3x 4GB VMs
Three worker containers, each running Puppeteer with headless Chrome. Each container had a 4GB memory limit. The math seemed fine: 50K screenshots / 30 days / 3 workers = ~550 screenshots per worker per day. Should be easy.
The Problems Started at Month Two
First month was fine. Month two, we started seeing OOM kills. A container would start at 500MB RSS, process screenshots for 8-12 hours, and eventually hit the 4GB limit. Docker would kill it, Kubernetes would restart it, and we'd lose whatever screenshots were in progress.
Our monitoring showed the classic puppeteer memory leak pattern:
Hour 0: RSS 480MB (Chrome baseline)
Hour 2: RSS 890MB (+410MB)
Hour 4: RSS 1.3GB (+410MB)
Hour 6: RSS 1.7GB (+400MB)
Hour 8: RSS 2.2GB (+500MB) ← GC getting less effective
Hour 10: RSS 2.9GB (+700MB) ← fragmentation accelerating
Hour 12: RSS 3.8GB (+900MB)
Hour 12.5: OOM KILLED
Memory grew at roughly 200MB per hour of continuous processing, accelerating as V8 heap fragmentation got worse.
The Band-Aids
We tried every fix the internet recommends. Here's what we did and what actually happened:
Band-Aid 1: Aggressive page.close()
The most common advice for a puppeteer memory leak. We audited every code path and made sure page.close() was called in every finally block. We even added a cleanup sweep that closed any pages left open after 30 seconds.
// We added this sweeper
setInterval(async () => {
const pages = await browser.pages();
for (const page of pages) {
if (page.url() !== 'about:blank') {
await page.close().catch(() => {});
}
}
}, 30000);
Result: Memory growth slowed from 200MB/hour to 150MB/hour. Still hit OOM, just took 14 hours instead of 12.
Band-Aid 2: Browser Recycling
We restarted Chrome every 100 requests.
let requestCount = 0;
const MAX_REQUESTS = 100;
async function getPage() {
if (requestCount >= MAX_REQUESTS) {
await browser.close();
browser = await puppeteer.launch(launchOptions);
requestCount = 0;
}
requestCount++;
return browser.newPage();
}
Result: This actually helped. Memory dropped back to ~500MB every 100 requests. But it introduced a new problem: the 3-5 second gap during browser restart where requests would fail. We added queuing to handle it, which meant more code to maintain.
Memory still grew between recycles (500MB to 1.2GB over 100 requests), so we dropped to recycling every 50 requests. Then 30. Each restart discarded warm caches, making screenshots slower.
Band-Aid 3: Reduced Concurrency
We dropped from 5 concurrent pages per worker to 2. Less memory pressure, but throughput tanked. We had to add a fourth worker to compensate, which meant more infrastructure cost.
Band-Aid 4: Blocked Resources
We blocked fonts, media, and stylesheets we didn't need:
await page.setRequestInterception(true);
page.on('request', req => {
if (['font', 'media', 'stylesheet'].includes(req.resourceType())) {
req.abort();
} else {
req.continue();
}
});
Result: Each page used less memory, but screenshots looked wrong without stylesheets. We had to bring CSS back. Font blocking saved maybe 10-20MB per page, which helped but didn't solve the underlying fragmentation.
The Real State After Band-Aids
After implementing everything:
| Metric | Before Fixes | After All Band-Aids |
|---|---|---|
| Time to OOM | 12 hours | Never (recycling prevents it) |
| Restart frequency | N/A | Every 50 requests |
| Workers needed | 3 | 4 (reduced concurrency) |
| Avg screenshot time | 3.2s | 4.8s (cold caches from recycling) |
| Failed screenshots | 2-3% (OOM) | 1-2% (restart gaps) |
| Monthly infra cost | $180 | $240 (extra worker) |
We stopped the OOM kills, but at the cost of slower screenshots, more infrastructure, and a fragile system. Any change to the codebase risked breaking the carefully tuned recycle thresholds.
The Human Cost
Here's what doesn't show up in monitoring dashboards: engineering time.
Two of us spent roughly 10-15 hours per month on Puppeteer-related work:
- Tuning recycle thresholds when we onboarded customers with heavier pages
- Debugging screenshot failures caused by restart timing
- Updating Chrome/Puppeteer versions (which sometimes changed memory behavior)
- Investigating "why are screenshots slow today" (answer: always memory pressure)
- Waking up to OOM alerts when a customer added pages that used more memory than expected
At our loaded engineering cost, that's $3,000-4,000/month of salary going to babysitting a screenshot pipeline. For a deeper breakdown of these hidden costs, see The Real Cost of Self-Hosting Screenshots.
Evaluating the Alternatives
We looked at three screenshot APIs. Our requirements: handle 50K screenshots/month, support PNG and PDF output, provide reasonable response times, don't cost more than our current infrastructure.
ScreenshotOne
Pricing at our volume: their Growth plan covers 25K screenshots for $79/month, but we needed 50K. That meant the Business plan at $259/month. API was solid, documentation was good. But the price was hard to justify.
Urlbox
Pricing at our volume: their plans start at $49/month for 5,000 screenshots. At 50K, we'd need a custom plan. Based on their per-screenshot pricing, the estimate was $200-250/month. Good feature set, but we'd be paying for features (thumbnails, retina) we didn't need.
SnapRender
Pricing at our volume: the Business plan at $79/month covers 50,000 screenshots. That's exactly our volume. No feature gating, so every plan gets the full API: device emulation, ad blocking, cookie banner removal, dark mode, full-page capture. For a detailed pricing comparison across providers, check Screenshot API Pricing Compared.
| API | Monthly Cost (50K) | Per-Screenshot |
|---|---|---|
| ScreenshotOne | $259 | $0.0052 |
| Urlbox | ~$225 | ~$0.0045 |
| SnapRender | $79 | $0.0016 |
| Self-hosted (infra only) | $240 | $0.0048 |
| Self-hosted (incl. eng time) | $3,500+ | $0.070 |
SnapRender was cheaper than our self-hosted infrastructure cost, and dramatically cheaper when you factor in engineering time.
The Migration
I expected the migration to take a week. It took one afternoon.
Our Puppeteer calls looked like this:
async function captureScreenshot(url, options) {
const page = await getPage();
try {
await page.setViewport({
width: options.width || 1280,
height: options.height || 720,
});
await page.goto(url, {
waitUntil: 'networkidle0',
timeout: 30000,
});
const buffer = await page.screenshot({
type: options.format || 'png',
fullPage: options.fullPage || false,
});
return buffer;
} finally {
await page.close().catch(() => {});
}
}
The SnapRender replacement using their npm SDK:
const { SnapRender } = require('snaprender');
const client = new SnapRender({ apiKey: process.env.SNAPRENDER_API_KEY });
async function captureScreenshot(url, options) {
const response = await client.screenshot(url, {
format: options.format || 'png',
width: options.width || 1280,
height: options.height || 720,
full_page: options.fullPage || false,
block_ads: true,
remove_cookie_banners: true,
});
return response.buffer();
}
The function signature stayed the same. The callers didn't need to change. We dropped the browser pool, the recycle logic, the memory monitoring, the cleanup sweeper, and the page timeout watcher. About 400 lines of infrastructure code deleted.
We also got block_ads and remove_cookie_banners for free. Those were features we'd wanted but never had time to implement in Puppeteer (they require maintaining filter lists and injecting CSS/JS).
The After Architecture
┌─────────────┐ ┌──────────────────┐ ┌────────────┐
│ Job Queue │────▶│ Worker (x1) │────▶│ S3 Bucket │
│ (BullMQ) │ │ Node.js │ │ (output) │
│ │ │ (HTTP calls │ │ │
│ │ │ to SnapRender) │ │ │
└─────────────┘ └──────────────────┘ └────────────┘
│
Single container
512MB memory limit
One worker. 512MB memory limit. No Chrome. No Puppeteer. Just HTTP requests to SnapRender's API.
Results After Three Months
| Metric | Self-Hosted Puppeteer | SnapRender API |
|---|---|---|
| Workers | 4 containers, 4GB each | 1 container, 512MB |
| Monthly infra cost | $240 | $20 (tiny VM) |
| API cost | $0 | $79 |
| Total monthly cost | $240 + eng time | $99 |
| Avg screenshot time (fresh) | 4.8s | 2-4s |
| Avg screenshot time (cached) | N/A | <200ms |
| Failed screenshots | 1-2% | <0.1% |
| OOM incidents/month | 0 (with recycling) | 0 (no browser) |
| Eng hours on screenshots | 10-15/month | <1/month |
| Memory monitoring code | 400+ lines | 0 lines |
The cached response time was a pleasant surprise. SnapRender caches screenshots with a configurable TTL. Since our competitor analysis often screenshots the same pages across multiple customer reports, about 40% of our requests hit the cache and return in under 200ms. Our Puppeteer setup never had caching because building a reliable cache layer was another project we never got to. (For more on how caching changes the performance equation, see From Timeout Hell to 200ms.)
What We Kept Puppeteer For
We didn't eliminate Puppeteer entirely. We still use it for:
- E2E testing: Puppeteer runs our integration tests in CI. It processes maybe 50 pages per test run, so the puppeteer memory leak never becomes an issue in short-lived processes.
- One-off scraping: When we need to extract structured data from a page (not just a screenshot), Puppeteer's
evaluate()is still the right tool.
The key distinction: Puppeteer is fine for short-lived, low-volume tasks. It becomes a problem when you run it as a long-lived service processing thousands of requests.
The Math That Made the Decision
If you're evaluating whether to keep fighting your own puppeteer memory leak or switch to an API, here's the calculation:
Self-hosted cost per month:
- Infrastructure: VMs, containers, load balancers
- Engineering time: hours/month * loaded hourly cost
- Incident cost: downtime events * impact
API cost per month:
- Plan price at your volume
- Minimal infrastructure for the API client
For us:
- Self-hosted: $240 infra + ~$3,500 eng time = $3,740/month
- SnapRender: $79 API + $20 infra + ~$300 eng time = $399/month
Even ignoring engineering time completely, the raw infrastructure cost was $240 vs $99. The API was cheaper before we even counted the hours we got back.
Migration Checklist
If you're considering the same move, here's what to plan for:
-
Audit your Puppeteer usage. Separate screenshots from browser automation (scraping, testing, form filling). Only screenshots move to an API.
-
Check your volume. SnapRender's pricing tiers: Free (500/mo), Starter $9 (2K/mo), Growth $29 (10K/mo), Business $79 (50K/mo), Scale $199 (200K/mo).
-
Test the API with your actual URLs. Some pages behave differently across screenshot services. Run your top 50 URLs through the SnapRender API and compare output quality. The complete API guide walks through every parameter.
-
Swap the implementation. If you have a clean abstraction over Puppeteer (a single
captureScreenshotfunction), you can swap in one afternoon. If Puppeteer calls are scattered across your codebase, budget a day to create the abstraction first. -
Remove infrastructure. Scale down workers, remove memory monitoring, delete the browser pool code. This is the satisfying part.
-
Monitor for a week. Watch error rates and screenshot quality. We found two edge cases in the first week (a site that required a specific user agent, and a page that needed a click before the content loaded). Both were fixable with SnapRender's API parameters.
The whole process took us about 6 hours of actual work spread over two days. The hardest part wasn't technical. It was admitting that the problem we'd spent four months on was never going to be fully solved by writing better code. Sometimes the right fix is to make it someone else's problem.