Building a Privacy-Aware RAG System

Retrieval-augmented generation is the dominant pattern for enterprise AI in 2026. Every customer-support assistant, internal-docs Q&A bot, and clinical-summary tool you've seen pitched in the last twelve months is, under the hood, a RAG system. The architecture is straightforward: index your documents into a vector store, retrieve the top-K relevant chunks for a user query, paste them into an LLM prompt, return the answer.

The privacy problem is also straightforward, and it's bigger than most teams realize. RAG has two distinct PII leak vectors — one at ingestion, one at inference — and protecting against only one of them is the same as protecting against neither.

This post lays out a defense-in-depth blueprint for building a RAG system that doesn't leak sensitive data, with concrete code and the trade-offs at each stage.

The two leak vectors

Diagram a vanilla RAG system and the failure modes light up immediately:

  Source docs ──▶ Chunker ──▶ Embedder ──▶ Vector store
                                                 │
                                                 ▼
  User query ──▶ Retrieval ──▶ Top-K chunks ──┐
                                              ▼
                                         LLM prompt
                                              │
                                              ▼
                                       Hosted LLM API
                                              │
                                              ▼
                                          Response

The two leak points:

Ingestion. Source documents contain PII (customer names, SSNs, medical context). When they're chunked and embedded, the raw text gets stored in the vector database alongside the embeddings — usually as the "metadata" or "content" payload that's returned with each retrieval. Anyone with read access to the vector store has access to the raw PII.
Inference. Even if you fixed problem 1, the retrieved chunks get pasted into the LLM prompt. If you're calling a hosted provider (OpenAI, Anthropic, AWS Bedrock), that PII leaves your perimeter on every query. The provider's terms of service may or may not promise not to train on it; either way, you've widened your blast radius.

Both leaks need to be addressed. Addressing only ingestion sends clean data to a model that you've still betrayed. Addressing only inference leaves raw PII sitting in your vector store waiting for an access-control mistake. The general API antipattern applies here as much as anywhere.

The privacy-aware architecture

The fix is to redact at both gates, using the same policy, with separate monitoring:

  Source docs ──▶ Chunker ──▶ Philter ──▶ Embedder ──▶ Vector store
                              (ingestion             (only redacted
                               policy)                content stored)
                                                          │
                                                          ▼
  User query ──▶ Retrieval ──▶ Top-K (clean) ──┐
                                               ▼
                                     Philter AI Proxy ──▶ LLM API
                                     (defense-in-depth        │
                                      + outbound scan)        │
                                                              ▼
                                                         Response
                                                          (scanned
                                                           for leakage)

          [Phield monitors detection patterns across both stages]

Three changes from the vanilla architecture:

Ingestion-time redaction. Every chunk passes through Philter before embedding. Sensitive entities are replaced according to a policy. Only the redacted version reaches the vector store.
Inference-time defense in depth. The Philter AI Proxy sits between the application and the hosted LLM. Even if a chunk slipped past ingestion, the proxy redacts before forwarding the prompt. The proxy also scans responses for PII the model might have hallucinated.
Continuous monitoring. Phield watches both stages and alerts on entity-type drift, volume spikes, or anomalies that suggest the policy needs an update.

Stage 1: ingestion-time redaction

The simplest version of the ingestion pipeline:

import requests
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

PHILTER = "http://philter.internal:8080/api/filter"
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=80)

def redact(text: str) -> str:
    r = requests.post(
        PHILTER,
        params={"c": "rag-ingest", "p": "rag-policy"},
        data=text,
        headers={"Content-Type": "text/plain"},
        timeout=5,
    )
    r.raise_for_status()
    return r.text

def ingest(document: str, source_id: str, store: Chroma) -> int:
    chunks = splitter.split_text(document)
    redacted_chunks = [redact(c) for c in chunks]
    metadatas = [{"source": source_id, "chunk_id": i}
                 for i in range(len(chunks))]
    store.add_texts(redacted_chunks, metadatas=metadatas)
    return len(chunks)

A few non-obvious trade-offs to be aware of at this stage:

Redact before embed, not after. Embeddings derived from redacted text are different from embeddings derived from raw text. If you embed raw and redact only the stored content, retrieval still works on the raw embeddings — meaning a query containing PII still finds the right chunks even though the chunks themselves are clean. That's a feature for some workloads and a leak for others; pick deliberately.
Use consistent pseudonyms across chunks. If "Patient Smith" appears in chunk 12 and chunk 47, the same synthetic name should replace both — otherwise the retrieved context becomes incoherent. The Phileas policy supports this via context-keyed replacement.
Watch your chunk boundaries. Sensitive entities (a phone number, a person's full name) can be split across chunk boundaries. Configure the chunker with overlap and prefer sentence-aware splitters to minimize this. If you can't avoid splits, run Philter after chunking with overlap rather than before.

Stage 2: defense-in-depth at inference

Even with ingestion locked down, two things can still leak PII at inference time:

The user's query itself may contain PII (a clinician asking about a specific patient). This goes straight to the LLM provider unless you stop it.
A chunk that should have been redacted at ingestion slipped through because the policy missed a novel entity type or a custom identifier.

The Philter AI Proxy handles both. It speaks the OpenAI, Anthropic, and Bedrock wire protocols, so your existing SDK code keeps working — you change one base URL:

# Before
client = OpenAI(api_key=KEY)

# After
client = OpenAI(
    api_key=KEY,
    base_url="https://philter-proxy.internal/v1",
)

# Same client. Same code paths. Prompts get redacted
# before they reach OpenAI; responses get scanned on the way back.

This is the defense-in-depth principle applied at the LLM boundary. Even if ingestion-time redaction has a hole, the proxy is the last line.

The embedding-as-data-leak problem

A subtler issue worth flagging: even without the raw text, embeddings themselves can leak.

Recent research (Morris et al., 2023; Song & Raghunathan, 2020) shows that sentence embeddings from popular models are partially invertible — an attacker with access to your vector store can reconstruct meaningful approximations of the source text from the embeddings alone. The implications are: locking down read access to the chunk metadata isn't sufficient if the embeddings themselves are exposed.

Mitigations:

Keep the vector store inside your perimeter. Self-host Chroma, Qdrant, Milvus, or pgvector. Don't hand your embeddings to a hosted vector database that's outside your trust boundary.
Apply Local Differential Privacy to embeddings. Adding calibrated noise to embeddings before storage makes inversion attacks substantially harder while preserving retrieval quality within bounded tolerance. Philter Diffuse implements the formal ε-budget framework for this.
Encrypt embeddings at rest. Most vector databases support this transparently via the underlying storage layer.

Auditing the privacy posture

RAG systems evolve constantly: new document sources get added, chunking strategies change, models are swapped. Each change is an opportunity to silently break the redaction layer. Three habits keep the system honest:

CI policy regression tests. Run Philter Scope against a gold-standard set on every policy change. If recall drops below your threshold, the build fails. We covered the CI pattern in detail elsewhere.
Live monitoring with Phield. Track the entity-type distribution on both stages. A sudden drop in detections at ingestion usually means a new document format the policy doesn't handle. A spike in detections at the proxy means raw PII is reaching inference — meaning ingestion has a hole.
Sample-and-review. Periodically sample retrieved chunks (post-redaction) and verify by hand that they're acceptable. Automated metrics tell you what the policy thinks it's catching; manual review tells you what it's actually catching.

For regulated workloads

RAG in healthcare, finance, or legal has additional concerns beyond what this post covers:

Healthcare RAG needs to satisfy HIPAA Safe Harbor at both ingestion and inference. The Safe Harbor blueprint covers the specific identifier mapping.
Cross-border RAG (a model hosted in one region, retrieved data sourced in another) needs explicit data-residency analysis. The proxy can enforce regional routing rules — sometimes the right answer is "this query never goes to the SaaS model at all."
Air-gapped or government-cloud RAG removes the hosted-LLM hop entirely; use a self-hosted open weights model (Llama, Mistral, etc.) and run the entire pipeline behind an air-gapped Philter deployment.

The bottom line

A privacy-aware RAG system isn't fundamentally different from a vanilla one — it's the vanilla architecture with two redaction gates and continuous monitoring. The toolkit to do that is small: Philter at ingestion, Philter AI Proxy at inference, Phield for monitoring, Philter Scope for measurement. None of them requires changing your retrieval or model code.

The hard part isn't the engineering — it's the discipline of treating privacy as a property of the pipeline, not a property of any single component. Pick the right policy once, apply it everywhere PII can leak, measure continuously.

If you're building a RAG system for a regulated workload and want to talk through the specifics, get in touch. We've done this for clinical chart-summarization, financial-services compliance assistants, and internal-docs bots over sensitive engineering wikis — the patterns are stable; the policy details vary.