Migrating from AWS Comprehend to Philter: A Practical Transition Guide

Teams move from AWS Comprehend PII to Philter for one of three reasons: the bill got uncomfortable, the data path stopped passing compliance review, or the customization surface ran out (custom entity types, domain lenses, per-entity replacement strategies). Whatever the reason, the migration itself is more mechanical than most teams expect.

This guide is the practical playbook: how Comprehend concepts map onto Philter, the code translation for the most common API calls, and a safe shadow-mode cutover pattern that lets you migrate without taking on detection-quality risk.

Concept mapping

Comprehend's PII surface and Philter's API don't line up one-to-one, but they're conceptually close. The translation table:

AWS Comprehend concept	Philter equivalent
`DetectPiiEntities`	`POST /api/find` — returns entity type, span, confidence
`ContainsPiiEntities`	`POST /api/find` with a length check, or use Phileas `FilterResponse.hasEntities()`
`DetectPiiEntitiesBatch`	Loop or batch endpoint (Philter accepts multi-document calls)
`StartPiiEntitiesDetectionJob` (async)	Run Philter against an S3 prefix from your own job runner, or via the Phinder scan tool
Default entity types (NAME, SSN, EMAIL, etc.)	Phileas built-in detectors for the same set, plus custom dictionaries and identifier patterns
Custom entity recognizer (training-based)	Phileas custom dictionary + identifier regex; for entity-level NER, train a custom lens on PhEye
Redaction via `POST /api/v2/redact` in Bedrock Guardrails	`POST /api/filter` — returns redacted text per policy
Confidence threshold	`confidence > N` condition in the policy's filter strategy
Language codes (en, es, fr, ...)	Per-policy language configuration; English is the default

Side-by-side: detecting PII

The simplest API translation. Comprehend before:

import boto3
client = boto3.client("comprehend")

resp = client.detect_pii_entities(
    Text="Patient John Doe SSN 123-45-6789.",
    LanguageCode="en",
)
for entity in resp["Entities"]:
    print(entity["Type"], entity["Score"], entity["BeginOffset"], entity["EndOffset"])

Philter after:

import requests

resp = requests.post(
    "http://philter.internal:8080/api/find",
    params={"c": "default", "p": "general"},
    data="Patient John Doe SSN 123-45-6789.",
    headers={"Content-Type": "text/plain"},
    timeout=5,
)
for entity in resp.json():
    print(entity["type"], entity["confidence"], entity["start"], entity["end"])

The field names differ (Score → confidence, BeginOffset → start) but the data is one-to-one. A small adapter function wraps either API behind a common interface during the cutover; we'll show that pattern below.

Side-by-side: redacting PII

Comprehend doesn't redact directly — you receive entity spans and write your own substitution code. Philter does both:

# Comprehend pattern (you handle the redaction)
resp = client.detect_pii_entities(Text=text, LanguageCode="en")
redacted = text
for ent in sorted(resp["Entities"], key=lambda e: -e["BeginOffset"]):
    redacted = redacted[:ent["BeginOffset"]] + "***" + redacted[ent["EndOffset"]:]

# Philter pattern (one call returns the redacted text)
redacted = requests.post(
    "http://philter.internal:8080/api/filter",
    params={"c": "default", "p": "general"},
    data=text,
    headers={"Content-Type": "text/plain"},
    timeout=5,
).text

The Philter side is shorter for a reason: the policy file specifies how each entity type is handled (mask vs. encrypt vs. replace vs. drop), so the API call doesn't need to know. That's the customization surface that's hard to replicate with Comprehend + handwritten substitution.

Mapping Comprehend entity types to a Philter policy

A minimal Philter policy that covers Comprehend's default PII entity set:

{
  "name": "comprehend-equivalent",
  "identifiers": {
    "ssn": { "ssnFilterStrategies": [{ "strategy": "REDACT", "redactionFormat": "***" }] },
    "creditCard": { "creditCardFilterStrategies": [{ "strategy": "REDACT" }] },
    "phoneNumber": { "phoneNumberFilterStrategies": [{ "strategy": "REDACT" }] },
    "emailAddress": { "emailAddressFilterStrategies": [{ "strategy": "REDACT" }] },
    "ipAddress": { "ipAddressFilterStrategies": [{ "strategy": "REDACT" }] },
    "url": { "urlFilterStrategies": [{ "strategy": "REDACT" }] },
    "person": { "personFilterStrategies": [{ "strategy": "REDACT" }] },
    "date": { "dateFilterStrategies": [{ "strategy": "REDACT" }] }
  }
}

You'll usually want to swap REDACT for something more useful per entity type once you've migrated. Mask the last four of credit cards; replace names with consistent synthetic names; shift dates rather than dropping them. The Redaction Policy Editor exists to make per-entity strategies a clickable choice rather than a JSON hand-edit.

Custom entity types

This is where many migrations actually start. Comprehend's custom entity recognizer requires training a model with a tagged dataset — meaningful effort, opaque model behavior, and a recurring cost per inference. Phileas covers most "custom entity" use cases without training:

Custom identifiers (medical record numbers, transaction IDs, internal client codes): regex pattern in the policy, often with checksum validation.
Custom dictionaries (project codenames, drug brand names, internal product names): newline-delimited word list referenced from the policy.
Domain-specific NER (e.g., medication names, financial product names): switch the lens on PhEye to a domain-specific model rather than training from scratch.

For genuinely novel entity types that require a trained model, our consulting team trains custom lenses on customer-provided gold-standard data. The deliverable is an ONNX model that drops into PhEye — same deployment surface as the stock lenses.

Deployment options

Comprehend is implicitly deployed (it's a managed AWS service). Philter has three deployment options, each fitting a different operational style:

AWS Marketplace. One-click deploy into your VPC; per-instance-hour billing through your existing AWS account. The closest analogue to "managed" without giving up the in-VPC data path.
Container. Pull the Docker image; deploy via your existing Kubernetes / ECS / Cloud Run pipeline. Maximum control.
Air-gapped. Self-contained image, local model repositories, no phone-home. Detailed pattern here.

For the migration, the AWS Marketplace deployment is usually the lowest-friction first step — you stay inside the same billing surface as Comprehend, you just route differently.

The shadow-mode cutover

Migration risk is detection-quality risk: what if Philter misses something Comprehend would have caught (or over-redacts something Comprehend would have left alone)? The safe pattern is to run both in shadow for a window, compare, then switch.

# Shadow-mode adapter
import boto3, requests, json

COMPREHEND = boto3.client("comprehend")
PHILTER    = "http://philter.internal:8080/api/find"
DIVERGE_LOG = "/var/log/redaction-divergences.jsonl"

def detect_pii(text: str) -> list:
    """Returns Comprehend output but logs divergences vs. Philter."""
    cw = COMPREHEND.detect_pii_entities(Text=text, LanguageCode="en")
    pf = requests.post(PHILTER, params={"c":"default","p":"general"},
                       data=text, headers={"Content-Type":"text/plain"}).json()

    cw_set = {(e["Type"], e["BeginOffset"], e["EndOffset"]) for e in cw["Entities"]}
    pf_set = {(e["type"].upper(), e["start"], e["end"])     for e in pf}

    if cw_set != pf_set:
        with open(DIVERGE_LOG, "a") as f:
            f.write(json.dumps({
                "text": text,
                "only_in_comprehend": list(cw_set - pf_set),
                "only_in_philter":    list(pf_set - cw_set),
            }) + "\n")

    return cw["Entities"]  # production still uses Comprehend during shadow

Production keeps using Comprehend. Every call also goes to Philter; divergences get logged for review. After a few days you'll have a clear picture of where the two engines disagree. Common patterns:

Philter catches more in domains where you've added custom identifiers (MRNs, transaction IDs) that Comprehend doesn't know about.
Comprehend catches more in entity types where its default detectors are tuned aggressively (often at the cost of precision).
Both disagree on confidence. The confidence threshold is the easiest knob to tune the policy on the Philter side — raise it to be more conservative, lower it to catch more.

Tune the Philter policy until divergences cluster in expected places, then flip the production code path. Philter Scope formalizes this against a gold-standard set if you want a numeric pass/fail gate before the cutover — we wrote about the measurement pattern.

Performance expectations

Two things to set expectations for when leaving Comprehend:

Latency goes down. Comprehend is a network hop to a multi-tenant service; Philter inside your VPC is a single local hop. For batched workloads, the difference is invisible; for real-time pipelines, it's typically 50-200ms per call.
Throughput is your problem now. Comprehend handles scaling transparently (within its rate limits); a self-hosted Philter instance has a fixed capacity ceiling determined by CPU and RAM. Plan for horizontal scaling (multiple instances behind a load balancer or one per Kafka partition) for any pipeline above moderate volume.

The bottom line

The Comprehend → Philter migration is mostly a swap of an API endpoint, plus a translation of "implicit defaults" into an explicit policy file. The harder work is the policy tuning — deciding which entity types to act on, what confidence threshold to use, what replacement strategy fits each. That's intentional: explicit configuration is the price of having one.

The shadow-mode pattern de-risks the cutover. After a few days of comparison, you'll know whether Philter's defaults match your needs, and you'll have a tuned policy ready for production.

If you'd rather have someone walk the migration with you — especially if you have custom Comprehend recognizers that need replicating — let's talk. We've done this transition for healthcare, financial services, and logistics teams; the patterns are stable, the policy details vary.