What is PII? A Practical Guide for Engineers and Compliance Teams

"PII" is one of those terms that everyone in technology uses and almost nobody defines the same way. Ask a compliance lawyer, a database engineer, and an ML scientist what PII is, and you'll get three different answers — all of them partially right. The disagreement doesn't matter much in conversation, but it matters a great deal when someone has to actually act on it: write a redaction policy, audit a data lake, or sign off on a release.

This post is a practical guide to what PII is, what it isn't, where it shows up, and how to handle it in production systems. Aimed at engineers and compliance leads who need a working definition rather than a regulatory citation.

The definition that actually matters

The textbook definition: Personally Identifiable Information is any data that can be used — on its own or in combination with other data — to identify a specific individual.

The "on its own or in combination with" clause is the part most people miss. A first name on its own isn't PII. A first name plus a zip code plus a date of birth almost certainly is. This is why the regulatory definitions across NIST SP 800-122, GDPR Article 4(1), CCPA, and HIPAA all converge on the same idea: identifiability is the test, not a fixed list of fields.

That said, regulators and engineers both need lists to work with. So PII gets split into two practical categories.

Direct identifiers

Data that identifies a person by itself. The classic list:

Full name
Social Security Number (SSN) or Tax Identification Number (TIN)
Driver's license number
Passport number
Email address (in most contexts)
Phone number
Street address
Bank account number
Credit card number
Biometric data (fingerprint, face scan, voice print)

Quasi-identifiers (or "linkable" data)

Data that doesn't identify by itself but, when combined with other data, narrows down to an individual. The famous Sweeney study showed that 87% of the U.S. population is uniquely identifiable from just three fields: 5-digit zip code, gender, and date of birth. Quasi-identifiers include:

Date of birth
Zip code (especially smaller ones)
Gender
Race / ethnicity
Occupation
Employer
Marital status
IP address
Device fingerprints / cookies
License plate
Educational background

A good rule of thumb: if you can think of a query that joins this field with two other fields and returns one person, it's a quasi-identifier.

PII vs PHI vs NPPI

Three terms that get used interchangeably and shouldn't be:

PII (Personally Identifiable Information). The broad umbrella term. Applies to almost any regulated context.
PHI (Protected Health Information). The HIPAA-specific subset: PII plus health-related context. A SSN by itself is PII; a SSN attached to a diagnosis is PHI, and the regulatory exposure is much higher. See our HIPAA Safe Harbor blueprint for the 18 specific identifiers that trigger PHI rules.
NPPI (Nonpublic Personal Information). The Gramm-Leach-Bliley Act's term for financial PII: account numbers, transaction history, financial statements, tax records. Subject to the GLBA Safeguards Rule.

The practical implication: redacting "PII" without specifying which framework you're operating under is how you end up out of compliance with the one that actually applies to you. A redaction policy for a bank looks different from one for a hospital, even if both are nominally "removing PII."

Why this matters operationally

Three reasons engineers should care:

Regulatory exposure. GDPR fines can reach 4% of global annual revenue. HIPAA violations run up to $50,000 per incident, capped at $1.9M per year per category. State laws in California, Texas, Colorado, and Virginia add their own teeth.
Breach amplification. The cost of a data breach scales with how much PII is exposed. A breach of 100,000 customer emails is bad; a breach of 100,000 records each containing name+SSN+account number is reportable, headline-worthy, and litigable.
Downstream contamination. Once PII enters your data lake, it propagates — into analytics warehouses, search indexes, ML feature stores, vector databases, log archives, backup snapshots. Cleaning up after the fact is exponentially harder than catching it at ingestion.

Where PII actually lives

If you've never gone looking for PII in your own systems, the volume tends to be surprising. The usual suspects:

Customer service tickets and chat transcripts. Customers paste their account numbers, SSNs, addresses, and credit card numbers into chat windows constantly.
Application logs. Email addresses, IP addresses, and session tokens regularly end up in log statements that get shipped to centralized log aggregators.
Free-text comment fields. Order notes, support tickets, doctor's notes, anywhere a user types prose into a system. This is where unstructured PII hides — and where regex fails.
Email and document archives. Decades of correspondence containing every imaginable category of PII, mostly unindexed.
Test fixtures and seed data. Developers copying production data "just for a repro" is the source of more PII leaks than any external attack vector.
ML training sets and vector embeddings. The newest category. PII embedded into a model's training set is functionally impossible to remove.

The first step in any privacy program is usually a discovery pass to find out where PII actually is — not where you think it is. Phinder exists for exactly that pass: point it at a bucket or filesystem, get back a structured inventory of which entity types appear in which files.

How to find PII reliably

Three approaches, each with trade-offs:

Pattern matching (regex). Fast, deterministic, perfect for structured identifiers (SSNs, credit cards, phone numbers). Useless for unstructured context — can't tell a name from a noun.
Named Entity Recognition (NER). A trained model reads text and tags entities like Person, Location, Organization. Catches what regex can't, but can over-flag and is slower.
Hybrid. Run pattern matching first (cheap and certain), then layer NER on top for the unstructured residue. This is what production-grade redaction looks like — we wrote about why neither approach alone is enough.

Phileas — the open source library underneath Philter — implements the hybrid approach with policy-driven configuration. You declare which entity types matter, what confidence threshold you'll accept, and how each entity should be handled if found.

How to handle PII once you've found it

Finding PII is half the problem. The other half is deciding what to do with each instance. Common strategies:

Redact. Replace with a placeholder (e.g., *** or [REDACTED]). Maximum safety, lowest data utility.
Mask. Keep the format, hide the content (e.g., a credit card becomes ****-****-****-1234). Useful when downstream systems need the last few digits for verification.
Encrypt. Reversibly transform with a key. The original can be recovered by a system holding the key; everyone else sees ciphertext. Format-preserving encryption keeps the data shape intact.
Hash / pseudonymize. One-way hash with a secret salt. Allows joins across records without exposing the underlying value — a single patient appears consistently as the same opaque ID.
Replace with synthetic. Substitute a realistic-looking fake value. Best for keeping data useful for analytics or ML training; ensures downstream systems behave the same as on real data.
Shift (for dates). Add a random offset per individual, preserving intervals between events. Required for some research workflows.
Drop. Remove the field entirely. Cleanest, but breaks records that depend on it.

A good policy picks a strategy per entity type, not a single strategy for the whole document. SSNs get encrypted; names get replaced with synthetic; dates get shifted; account numbers get masked to the last four; everything else drops. The Redaction Policy Editor exists to make this kind of per-entity configuration approachable without writing JSON by hand.

Measure, don't trust

No PII discovery system is perfect. The right question isn't "did we catch everything?" — the right question is "what percentage did we catch, and is it good enough for our domain?"

Philter Scope scores any redaction pipeline on precision (of what we redacted, how much was actually PII?) and recall (of all the PII that existed, how much did we catch?). Healthcare workflows need recall close to 100% to satisfy HIPAA Safe Harbor; marketing workflows often weight precision higher to keep copy readable. Same toolkit, different thresholds — the threshold depends on the domain.

The bottom line

"PII" sounds like a binary thing — either data is PII or it isn't — but in practice it's a spectrum of identifiability that varies with what other data you have. The job of a privacy program is to find PII reliably across structured and unstructured data, apply the right handling strategy per entity type, and measure performance against your actual workload.

That's what the Philterd toolkit is built for: discovery (Phinder), redaction (Philter), monitoring (Phield), and measurement (Philter Scope) — all open source, all self-hosted, all sharing the same policy format. If you want to talk through how that maps to your specific data, get in touch.