Talk to an Expert

Tell us about your stack and the privacy problems you're trying to solve. We typically respond within one business day.

Prefer to skip the form? Pick a time on our calendar →
or send a message

← All posts

Architecting Privacy in Kafka: Real-Time Redaction for Streaming Data

Most organizations think about PII the same way they think about backups: as a thing they'll worry about once it lands in a database. But in a streaming architecture, that's already too late. The moment a message hits a Kafka topic, dozens of downstream consumers — analytics jobs, search indexers, archival sinks, ML pipelines — can read it. Once a single SSN reaches one downstream system, it's everywhere.

This is the "PII at rest" problem inverted. The right place to redact streaming data isn't downstream — it's in flight, between the producers and the first consumer that doesn't need the raw values. We learned this pattern years ago building Phirestream; today the same approach is implemented with Philter and the underlying Phileas library, depending on the constraints of your stack.

This post walks through three production-grade patterns for adding real-time redaction to a Kafka (or Kinesis) pipeline, with code you can adapt.

The basic architecture

Whatever specific technology you use, the shape of the architecture stays the same:

           ┌──────────────┐
producers ─▶│  raw-events  │  (Kafka topic; restricted access)
           └──────┬───────┘
                  │
                  ▼
        ┌──────────────────┐
        │ Redaction stage  │ ◀── policy.json
        │ (Phileas/Philter)│
        └──────┬───────────┘
               │ (failures)        ┌─────────┐
               ├──────────────────▶│  DLQ    │
               │                    └─────────┘
               ▼
        ┌──────────────────┐
        │ redacted-events  │  (Kafka topic; broad consumer access)
        └──────────────────┘

Three things matter about this layout:

  • The raw topic is locked down. Only the redaction stage and a small set of audited consumers (compliance, incident response) can read it. ACLs and topic-level encryption do the work.
  • The redacted topic is open to the rest of your data platform — warehouses, search clusters, ML feature stores, downstream services. By the time data leaves the redaction stage, it's safe to fan out widely.
  • Failures don't disappear silently. A dead-letter queue captures any message the redaction stage couldn't process. Without it, you'll find out about a Phileas misconfiguration when it's already produced a million unredacted messages.

Now to the three ways to implement the redaction stage.

Pattern 1: Kafka Streams with Phileas embedded

If your stream processing is already on the JVM, this is the lowest-latency option. Phileas is a Java library; you add it as a dependency, instantiate the filter service once, and call it from your Kafka Streams topology. There's no network hop and no extra service to deploy.

// build.gradle
implementation 'io.philterd:phileas:2.12.0'
implementation 'org.apache.kafka:kafka-streams:3.7.0'
import io.philterd.phileas.model.policy.Policy;
import io.philterd.phileas.services.PhileasFilterService;
import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.KStream;
import java.util.Properties;
import java.util.UUID;

public class RedactionStream {

  public static void main(String[] args) throws Exception {
    // Load the policy once at startup. Reload on a schedule
    // if you want hot updates without a restart.
    Policy policy = Policy.fromFile("policies/healthcare.json");
    PhileasFilterService phileas = new PhileasFilterService(policy);

    Properties config = new Properties();
    config.put(StreamsConfig.APPLICATION_ID_CONFIG, "philter-redactor");
    config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");
    config.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);

    StreamsBuilder builder = new StreamsBuilder();
    KStream<String, String> input = builder.stream("raw-events");

    KStream<String, String>[] branches = input
        .mapValues(text -> {
          try {
            return phileas
                .filter("healthcare", "default", UUID.randomUUID().toString(), text)
                .getFilteredText();
          } catch (Exception e) {
            return null;  // signal failure
          }
        })
        .branch(
            (k, v) -> v != null,
            (k, v) -> true       // catch-all -> DLQ
        );

    branches[0].to("redacted-events");
    branches[1].to("philter-dlq");

    KafkaStreams streams = new KafkaStreams(builder.build(), config);
    Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
    streams.start();
  }
}

When to use this: JVM stream processing, tight latency SLAs (single-digit milliseconds per message), high message volume. Embedding Phileas means no network round-trip and minimal serialization overhead.

Trade-offs: All your redaction logic now lives inside your Streams app. Policy updates require an application redeploy unless you build a hot-reload mechanism. Scaling the redaction throughput means scaling the whole Streams app.

Pattern 2: Sidecar consumer/producer hitting Philter's HTTP API

If your stack is polyglot — Python data engineers, Go service teams, Node.js producers — deploying Philter as a sidecar service and calling its REST API from a small consumer/producer process keeps every team using the same redaction engine without forcing a JVM dependency on anyone.

# pip install kafka-python requests
import json
import requests
from kafka import KafkaConsumer, KafkaProducer

PHILTER = "http://philter:8080/api/filter"

consumer = KafkaConsumer(
    "raw-events",
    bootstrap_servers="kafka:9092",
    group_id="philter-redactor",
    enable_auto_commit=False,
    value_deserializer=lambda v: v.decode("utf-8"),
)

producer = KafkaProducer(
    bootstrap_servers="kafka:9092",
    acks="all",
    value_serializer=lambda v: v.encode("utf-8"),
)

session = requests.Session()  # reuse connections

for msg in consumer:
    try:
        r = session.post(
            PHILTER,
            params={"c": "default", "p": "healthcare"},
            data=msg.value,
            headers={"Content-Type": "text/plain"},
            timeout=2,
        )
        r.raise_for_status()
        producer.send("redacted-events", r.text)
    except Exception as exc:
        producer.send("philter-dlq", json.dumps({
            "original": msg.value,
            "error": str(exc),
            "offset": msg.offset,
            "partition": msg.partition,
        }))
    finally:
        consumer.commit()

For HA, deploy multiple Philter instances behind a load balancer (or in a Kubernetes Service). Because Philter is stateless and policy-driven, you can horizontally scale it independently of the redaction app.

When to use this: polyglot teams, organizations standardizing on one redaction service across many pipelines, deployments where the operations team is more comfortable scaling stateless HTTP services than embedded libraries.

Trade-offs: network round-trip per message (typically a few milliseconds inside a VPC). Worth amortizing with batching if your throughput is high — Philter's API accepts batched documents per call.

Pattern 3: Kafka Connect Single Message Transform (SMT)

If you're already moving data with Kafka Connect — pulling from a database, sinking to S3, mirroring across clusters — the cleanest place to add redaction is as a Single Message Transform. The same SMT can sit on a source connector (redact before data enters Kafka), a sink connector (redact before data leaves Kafka), or a mirror (redact during cross-cluster replication).

public class PhileasRedaction<R extends ConnectRecord<R>>
        implements Transformation<R> {

  private PhileasFilterService phileas;
  private String policyName;
  private String context;

  @Override
  public void configure(Map<String, ?> props) {
    String policyPath = (String) props.get("policy.path");
    this.policyName  = (String) props.getOrDefault("policy.name", "default");
    this.context     = (String) props.getOrDefault("context", "default");
    this.phileas = new PhileasFilterService(Policy.fromFile(policyPath));
  }

  @Override
  public R apply(R record) {
    Object value = record.value();
    if (!(value instanceof String)) return record;
    try {
      String redacted = phileas.filter(
          policyName, context, UUID.randomUUID().toString(), (String) value
      ).getFilteredText();
      return record.newRecord(
          record.topic(), record.kafkaPartition(),
          record.keySchema(), record.key(),
          record.valueSchema(), redacted,
          record.timestamp()
      );
    } catch (Exception e) {
      throw new ConnectException("Phileas redaction failed", e);
    }
  }

  @Override public ConfigDef config() { /* ... */ return new ConfigDef(); }
  @Override public void close() { /* nothing to clean up */ }
}

Wire it into a connector config:

{
  "name": "s3-sink-redacted",
  "config": {
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "topics": "raw-events",
    "s3.bucket.name": "analytics-redacted",
    "transforms": "philter",
    "transforms.philter.type": "io.example.PhileasRedaction",
    "transforms.philter.policy.path": "/etc/policies/healthcare.json",
    "transforms.philter.policy.name": "healthcare"
  }
}

When to use this: existing Kafka Connect deployments, declarative pipelines where the team prefers config over code, scenarios where the same redaction needs to apply across many connectors with one consistent implementation.

Amazon Kinesis: the same pattern

Kinesis isn't Kafka, but the redaction architecture is identical — just substitute KCL/KPL for Streams. A minimal record processor:

public class RedactionProcessor implements ShardRecordProcessor {

  private final PhileasFilterService phileas;
  private final KinesisProducer producer;

  @Override
  public void processRecords(ProcessRecordsInput input) {
    for (KinesisClientRecord record : input.records()) {
      String text = StandardCharsets.UTF_8
          .decode(record.data()).toString();
      try {
        String redacted = phileas.filter(
            "healthcare", "default",
            record.sequenceNumber(), text
        ).getFilteredText();
        producer.addUserRecord(
            "redacted-events", record.partitionKey(),
            ByteBuffer.wrap(redacted.getBytes())
        );
      } catch (Exception e) {
        // route to DLQ stream
      }
    }
  }
  // initialize/leaseLost/shardEnded/shutdownRequested omitted
}

The library you embed and the policy file you load are exactly the same; only the streaming runtime changes.

Operational considerations

Dead-letter queues are non-negotiable

In a stream-processing context, "drop on failure" is a stealthy way to lose compliance evidence. Always route exceptions to a separate DLQ topic with metadata (original payload, partition, offset, error message). An operator can replay those messages once the underlying issue (bad policy, missing entity type, malformed input) is fixed.

Schema Registry compatibility

If you're using Avro or Protobuf with a Schema Registry, the redacted topic should reuse the same schema as the raw topic — redaction modifies values, not shapes. For text fields nested inside a record, redact field-by-field rather than swapping out a whole payload as a string.

Throughput and scaling

Phileas's pattern layer is essentially CPU-free; the NLP layer is the variable cost. For high throughput:

  • Pre-load policies and the underlying model server (PhEye) at startup so cold-start latency doesn't hit the first batch.
  • Tune the policy: deterministic detectors (SSN, credit card, IBAN) are nearly free. Domain lenses (Healthcare, COVID-19) cost more — only enable the lenses you actually need.
  • Batch where possible. The Philter HTTP API accepts batched calls; the Phileas library has a streaming entry point that amortizes model invocation overhead.
  • Partition appropriately. The redaction stage scales horizontally exactly like any Kafka consumer group — one consumer instance per partition is the ceiling.

Monitoring the privacy pipeline itself

Once redaction is in place, the next question becomes "is it working?" Two complementary tools cover the answer:

  • Philter Scope measures the policy's precision and recall against a gold-standard test set — we wrote more about this elsewhere. Run it as part of your CI for the policy file, so a bad commit gets caught before it ships to production.
  • Phield watches the live redaction stream and alerts on anomalies — sudden spikes in a particular entity type, volume drops that suggest the pipeline has stalled, or detection patterns that drift from the established baseline.

The downstream LLM trap

One last thing: if any of your downstream consumers feed Kafka data into a generative AI provider (OpenAI, Anthropic, Bedrock), in-stream redaction protects you up to the LLM's network boundary — but not past it. For that hop, see the Philter AI Proxy, which applies the same redaction philosophy to outbound LLM traffic.

The bottom line

Streaming data multiplies risk: every consumer is a potential exfiltration path. The fix is to put the redaction stage between the producers and the rest of your data platform, not after — so that by the time messages reach analytics, search, ML, or archival, the sensitive bits are already gone.

Whether you pick the embedded Phileas pattern, the Philter HTTP sidecar, or a Kafka Connect SMT depends on your team and your existing pipeline. All three use the same policy format, the same detectors, and the same audit story. The "right" answer is whichever one fits next to the code you already have.

Need help mapping this to your specific Kafka or Kinesis architecture? Talk to our consulting team — we've stood up streaming redaction pipelines for healthcare claims, log aggregation, and financial transaction flows, and we'd be glad to compare notes on yours.