The TCO of "Free" Cloud PII Redaction: AWS Comprehend, Google DLP, vs Self-Hosted at Scale

Pricing disclaimer: the dollar figures, per-unit rates, and tier discounts below reflect the published prices for AWS Comprehend, Google Cloud DLP, and the AWS Marketplace at the time of writing. Cloud providers update their pricing frequently — sometimes monthly, sometimes via opaque enterprise discount programs. Always verify the current rates on each provider's pricing page before using these numbers for a budgeting decision. The relative shape of the comparison (per-volume vs. per-instance billing) is stable; the specific dollar amounts may not be.

"Pay only for what you use" is the universal promise of cloud APIs, and for low-volume workloads it's genuinely true. The catch shows up when you scale: per-character or per-record pricing that looked like rounding error at 10,000 documents/day becomes a six-figure line item at 10 million.

This post does the math. We'll model a realistic enterprise workload, run the actual published pricing through it for AWS Comprehend PII and Google Cloud DLP, and compare to Philter running on the same cloud marketplaces. The goal isn't to argue one is always cheaper — it's to find the break-even point so you can make the call deliberately rather than discover it on the next invoice.

The pricing models

The three options price along very different axes:

AWS Comprehend PII bills per character (technically per 100-character "unit"). At list price, PII detection is $0.0001 per unit, with a discount on the first 10M units per month. Containment (whole-document classification) and entity recognition are billed separately.
Google Cloud DLP bills per byte processed. Standard infoType detection is $1.00 per GB, with tiered discounts above 1 TB/month. Custom infoTypes, image redaction, and de-identification transforms each have their own line.
Philter on a cloud marketplace bills per instance-hour. On the AWS Marketplace at standard list, a Philter instance runs at a flat hourly rate regardless of how many documents you push through it. Throughput is bounded by the instance's CPU/RAM rather than your wallet.

The shape of this difference matters more than the headline numbers. Per-volume billing scales linearly with workload; per-instance billing is a fixed-cost step function. Below the step, SaaS wins; above the step, self-hosted wins. The interesting question is where the step is.

A worked example

Let's model a realistic mid-sized enterprise workload: a customer-support pipeline processing transcripts, tickets, and chat logs.

Volume: 5 million documents per day.
Average document size: 2 KB of text (~2,000 characters, ~20 "units" in Comprehend's billing).
Daily characters: 5M × 2,000 = 10B characters = 100M Comprehend units.
Daily bytes: 5M × 2 KB = 10 GB.
Monthly: ~3 trillion characters, ~300 GB, ~150M documents.

This is not a hypothetical — it's roughly the steady-state volume of a regional bank's customer-channel workload, or a mid-sized hospital network's clinical-note throughput. Run the numbers:

	Philter (self-hosted)	AWS Comprehend PII	Google Cloud DLP
Billing unit	Instance-hour	Per 100 characters	Per GB
List unit price	~$2/hr (mid-tier instance)	$0.0001/unit (post-tier)	$1.00/GB (post-tier)
Compute for our volume	2× instances, 24/7 HA	3B units/month	300 GB/month
Estimated monthly cost	~$2,900/mo	~$300,000/mo	~$300/mo (DLP only)*

*Google DLP at 300 GB/month looks like a steal here — but DLP charges separately per transform applied (de-identification, replacement, etc.), and complex policies multiply the bill. Custom infoTypes also carry separate per-call charges. The realistic "all-in" number after enabling everything an enterprise actually needs typically lands closer to AWS Comprehend's range.

Comprehend's number stands out because per-character billing punishes verbose content (long support transcripts, chat logs, free-text fields) disproportionately. At list price, this volume is roughly the cost of a small engineering team's annual salary — on one API line item.

The break-even point

The crossover where self-hosted Philter becomes cheaper than the per-volume options is closer than most teams assume:

vs. AWS Comprehend PII at list price: Philter on a marketplace instance breaks even at roughly 250,000 documents per day (assuming similar 2KB documents). Below that, Comprehend's tiered discount keeps it competitive. Above that, the per-character meter wins for Philter every time.
vs. Google Cloud DLP at list price: The crossover depends heavily on which transforms you enable. For raw infoType detection only, Google can stay cheaper to fairly high volumes. Once you start layering de-identification transforms, custom infoTypes, and image redaction, the multipliers stack and self-hosted catches up quickly.

The general principle: any workload pushing more than ~10 GB of text per day through a redaction pipeline is in the zone where a TCO conversation is worth having.

The hidden costs of "free" SaaS

The per-character number is the visible cost. The less-visible costs that round out the actual TCO:

Data egress. If your data sits in one cloud and the redaction service runs in another (or in a different region of the same cloud), every redaction call moves bytes across a network boundary. Cross-region or cross-cloud egress is metered separately, and at high volume it can match or exceed the redaction bill itself.
Latency tax. Each API call is a network round-trip. For batch workloads, this is amortizable; for real-time pipelines (chatbots, streaming logs, agent-assist tools), it compounds. Self-hosted Philter inside your VPC is one local hop.
Vendor lock-in. Comprehend is AWS-only; DLP is GCP-only. Multi-cloud or migration teams end up writing the same pipeline twice or pinning themselves to one provider's pricing roadmap.
Data residency. Sending text to a SaaS endpoint moves data into the vendor's environment — we covered why that's an architectural problem separate from the cost question.
The training-data trap. SaaS providers' terms of service vary on whether they use submitted text to train future models. Even when the terms say "no," verifying that claim is impossible.

The hidden costs of self-hosted

Honesty cuts both ways. Self-hosted isn't free either:

Ops time. Someone has to patch the OS, update the container image, monitor the instances. The marketplace listings reduce this substantially (Philter ships with everything bundled), but it's not zero. Budget ~10% of an SRE FTE for any non-trivial self-hosted deployment.
Capacity planning. Self-hosted means you pay for headroom you're not always using. Auto-scaling groups soften this but don't eliminate it. Per-call SaaS billing is "spiky workload friendly" in a way that self-hosted isn't.
Initial setup. Provisioning, VPC config, IAM, monitoring wire-up. One-time cost, but real.
Model updates. When new NLP models are released, you have to plan a rollout. SaaS providers do this transparently.

For a workload above the break-even line, these costs are small relative to the per-volume savings. For a workload below the line, they tip the math back toward SaaS.

The TCO matrix

Pulling it together, the decision matrix usually comes down to four factors:

Volume. Above ~250K documents/day, self-hosted wins on price alone.
Cloud commitment. Single-cloud organizations save more with the native option (Comprehend on AWS, DLP on GCP). Multi-cloud or on-prem requirements push toward self-hosted.
Data residency. Regulated workloads (healthcare, finance, government) generally need self-hosted regardless of cost — the SaaS data path is the deal-breaker, not the bill.
Customization needs. Custom entity types, domain-specific lenses, or per-policy precision/recall tuning are first-class in Phileas and rate-limited or extra-cost in the SaaS options.

A simple decision: low volume + single cloud + generic detectors → SaaS may be cheaper. High volume + multi-cloud or regulated + custom detectors → self-hosted is cheaper, often by an order of magnitude.

The bottom line

"Free" SaaS PII redaction isn't free; it's "pay as you go," and "as you go" turns into real money faster than most teams budget for. At enterprise scale, the per-character meter regularly exceeds the cost of a self-hosted deployment — sometimes by 50× or more — before you factor in egress, lock-in, and residency.

The right answer for your team depends on your specific volume curve, cloud strategy, and regulatory posture. If you want a worked TCO for your actual numbers (we'll model it against your projected volume and current usage), get in touch. We've done this analysis for healthcare, finance, and logistics customers; the surprise is almost always how soon the crossover hits.

If you're already on Comprehend or DLP and the bill has gotten uncomfortable, the next post in this series — "Migrating from AWS Comprehend to Philter" — walks through the API translation and shadow-mode cutover.