Philter
Turnkey, self-hosted PII redaction with a clean API. Drops into any pipeline that needs sensitive data removed from text — and runs entirely inside your cloud.
Philterd builds open source, self-hosted privacy software for the workloads where a PII leak isn't an option — healthcare, finance, legal, and government pipelines that need redaction wired in from day one.
Every tool we ship runs inside your cloud. Your data never leaves your perimeter, never reaches a third-party API, and never lands in someone else's logs.
New · Open Source
Arbiter is the newest addition to our open source toolkit for PII privacy — a human-in-the-loop review surface for redaction pipelines. Reviewers see every detection in context, accept or override automated decisions, and apply structured exemption codes that flow into your audit trail. Built on Philter; designed for AI training-data prep and regulated everyday workflows.
A complete stack for finding, redacting, monitoring, and auditing sensitive data — from low-level libraries to turnkey services. Each project is Apache 2.0 licensed and developed in the open on GitHub.
Turnkey, self-hosted PII redaction with a clean API. Drops into any pipeline that needs sensitive data removed from text — and runs entirely inside your cloud.
The core redaction, anonymization, masking, and replacement library underneath Philter. Available in Java, Python, .NET, and Go.
The trained AI and NLP models that find PII and PHI in text, plus the service that hosts them. Designed to plug directly into Phileas and Philter.
High-speed discovery scanner that crawls files and storage to map where sensitive information actually lives across your environment.
Intelligent monitoring that tracks PII flow across the organization and alerts on suspicious activity or unusual trends.
Drop-in proxy that redacts PII and PHI before prompts reach LLM providers like OpenAI and Anthropic Claude.
Standalone audit tool that scores redaction policies on precision and recall, so policy changes can be measured rather than guessed at.
Human-in-the-loop PII redaction. Search, review, and override automated detection decisions with structured exemption codes — built for AI training-data prep and regulated everyday workflows.
Privacy-first analytics that applies differential privacy to PII counts, preserving statistical utility without exposing individuals.
Web console that lets non-technical users build and deploy redaction rules through a visual, no-code interface.
Client SDKs for Java, .NET, and Go are available alongside the rest of the toolkit at github.com/philterd.
Same redaction engine, three paths. Pick the one that fits your team.
Free forever
$0 · Apache 2.0
Run the entire Philterd toolkit yourself. Full source on GitHub — no license keys, no usage caps, no commercial review.
Best for: Engineering-led teams who want to own every layer.
Per-hour billing
Pay-as-you-go
Deploy Philter into your cloud from AWS, Google Cloud, or Azure marketplaces. Production-ready in minutes; billed through your existing cloud account.
Best for: Teams that want production-ready Philter without managing builds or ops.
Available on AWS, Google Cloud, and Azure.
Engagement-based
Custom
Work directly with the people who built the toolkit. Custom NLP models, privacy architecture, embedded engineering, and production deployment with full handoff.
Best for: Healthcare, finance, and government workloads with custom requirements.
Everything we build sits on top of three commitments: data stays with you, the source code stays open, and the AI underneath stays purpose-built for the job.
Philter and the rest of the Philterd toolkit run inside your cloud. Your data never leaves your perimeter, never reaches a third-party API, and never lands in someone else's logs.
Transparency is the only way to verify privacy software. Our core engine is Apache 2.0 licensed — your engineers can read every line, audit every decision, and extend the stack on their own terms.
Generic LLMs make poor privacy filters. We train and ship specialized NLP and deep-learning models built specifically for PII and PHI detection — accurate, tunable, and operationally affordable at scale.
Specialized Large Language Models trained exclusively for high-accuracy PII discovery, classification, and redaction. An effective PII strategy combines pattern matching for structured data with AI for the unstructured text where rigid patterns fall short.
Predefined character sequences detect structured data like credit-card numbers, SSNs, and email addresses. Fast, predictable, and lightweight to run — but rigid in the face of unstructured text.
Trained models read the linguistic context and intent around sensitive data. Highly accurate, adaptable across languages, and effective on the unstructured text where patterns alone fail.
Pattern matching provides a high-speed foundation; LLMs add the intelligent oversight unstructured text demands. The two methods complement each other inside every Philterd deployment.
Three deliberate stages turn general-purpose model architectures into specialized PII and PHI detectors — how the data is sourced, how the model is trained, and how the result is measured. Each stage is engineered so accuracy never comes at the cost of privacy.
We assemble high-fidelity datasets spanning global PII and PHI entities — tax identifiers, medical terminology, financial records. Where privacy demands it, we generate millions of realistic synthetic records so models never see real sensitive data during training.
Models are fine-tuned to identify entities based on the words surrounding them, learning linguistic intent rather than surface patterns — so they can distinguish sensitive information from harmless data that happens to share a similar structure.
Every model release is evaluated against a gold-standard benchmark suite that measures precision, recall, and F1 on the entity types that matter — so policy and architecture decisions rest on measurable performance, not vendor claims.
Curious about the technical details? Read about our hybrid approach →
Practical posts on PII redaction, AI privacy, and self-hosted compliance.
· Philter, Redaction
Automated redaction handles most of the volume; humans handle the last few percent that automation can't. Arbiter is the open source review surface that bridges the two — built on Philter, designed for AI training data and regulated everyday workflows.
Read post →· AWS, Philter
Per-character SaaS pricing looks cheap at demo scale and gets eye-watering at production scale. A worked-example TCO comparison: AWS Comprehend, Google Cloud DLP, and self-hosted Philter on the marketplace.
Read post →· Redaction, Philter
PII is the term everyone uses and few people define the same way. A practitioner's guide to what counts as PII, how to find it in real data, and how to handle it without breaking everything downstream.
Read post →


Philterd provides a zero-trust architecture for HIPAA, GDPR, and CCPA compliance. The discovery engine operates entirely within your infrastructure — 100% data sovereignty, no external API dependencies, no third-party data training.
To satisfy HIPAA Safe Harbor requirements, we pair high-speed pattern matching for structured identifiers with specialized AI models for everything else, capturing all 18 protected identifiers under 45 CFR § 164.514. Healthcare and life-sciences organizations can automate de-identification across massive datasets while preserving the utility the data needs for research and innovation.
Your Data. Your Infrastructure. Total Sovereignty.
Most redaction solutions require a trade-off between intelligence and privacy, forcing you to send sensitive data to third-party APIs for processing. We remove this risk with a privacy-first architecture designed for zero-trust environments.
Our AI models and processing engines run entirely within your own VPC or on-premise hardware. No sensitive data ever leaves your secure perimeter.
Engineered for high-security sectors, the Philterd suite can operate in completely offline environments with no outbound internet dependency.
We do not and cannot see your data. Our tools process information in-memory, ensuring that your raw inputs are never logged, stored, or used to train our models.
By keeping the entire PII lifecycle — from discovery to redaction — local, you maintain a clean chain of custody that satisfies the most stringent global security audits.
Every API call is processed independently — no session state, no shared cache, no cross-request memory. One request can't leak information from a prior one, and a restarted instance is functionally identical to a fresh one.
Every line of the redaction engine is Apache 2.0 licensed and inspectable on GitHub. No black-box AI, no proprietary binaries — your security team can audit the code that touches your data.
High-Performance Intelligence Without Privacy Compromise.
We believe the tools used to protect privacy should be built with the highest privacy standards. Our AI model development process is designed to ensure the "brains" of our systems are powerful, ethical, and secure.
We use high-fidelity synthetic data to train our models. By generating millions of realistic data scenarios — from medical records to financial statements — we train our AI models to recognize sensitive entities without ever exposing them to real-world PII.
Because our models are trained on synthetic datasets, there is zero risk of model memorization — no chance an LLM accidentally reveals sensitive training data in its output.
Every model version is rigorously tested against Philter Scope to ensure it meets our strict standards for accuracy, recall, and the reduction of false positives before it is ever released to your environment.
Need help mapping your HIPAA, GDPR, or PCI posture to a Philter deployment? Get an architecture review →
Accelerate compliance and reduce leak risk by working directly with the creators of Philter. We design, build, and deploy the privacy infrastructure your team will own — not a black box you have to renew every year.
We design end-to-end PII protection for your cloud and AI workloads — data flows, redaction layers, audit trails, and the guardrails that keep them aligned with HIPAA, GDPR, and CCPA.
Off-the-shelf models miss the entities that matter most in your domain. We train specialized PII/PHI detectors on your data, evaluated against precision and recall you can measure.
Pre-launch privacy review of generative AI and RAG systems. We trace PII through prompts, retrieval context, tool calls, logs, and vector stores — then ship the redaction and policy changes.
Full-scale evaluation of your existing privacy posture against the regulatory requirements you actually have to meet — and a prioritized remediation roadmap your team can execute.
Rapid triage when a privacy incident hits production. We scope exposure, contain the leak, instrument detection, and document the timeline for regulators and counsel.
Work directly with the creators of Philter. We pair with your developers, contribute production-grade code to your repos, and leave behind systems your team owns.
Have a specific project in mind? Schedule a 30-min call →
Philterd was founded in 2017 by Jeff Zemerick (LinkedIn) on a single principle: your most sensitive data should never leave your control. Years later, we're still the people building the privacy software our clients run in production — every line of code, every model, every release.
Philterd was founded by Jeff Zemerick after watching commercial privacy tools turn into proprietary black boxes — APIs that required sending sensitive data to the cloud just to redact it. We believed there was a better way.
We started by building Phileas as an open source library — auditable, embeddable, and free for anyone to use. It was the proof that privacy software didn't have to be opaque. The library quickly grew into the engine behind Philter, the enterprise-grade redaction API used today by healthcare, legal, and financial organizations.
Unlike vendors that wrap third-party APIs and resell the result, we own the models, the runtime, and the policy engine. Every component of the Philterd ecosystem is engineered in-house and released under Apache 2.0 — code you can read, audit, and extend.
When you email us, you reach the engineers who wrote the line of code in question. No outsourced support tier, no ticket triage gauntlet — just direct access to the maintainers.
Led by the PMC Chair of Apache OpenNLP, an Apache Software Foundation Member, and 15+ years of production NLP work. The models behind Philterd are built by the people who build the frameworks underneath them.
Every product we ship runs entirely inside your perimeter. No outbound API calls, no third-party data sharing, no surprise pricing changes. The architecture isn't a marketing choice — it's a structural commitment to the original principle.
Read the full story: From Phileas to Philter →
Tell us about your stack and the privacy problems you're trying to solve. We'll get back to you within one business day.