Talk to an Expert

Tell us about your stack and the privacy problems you're trying to solve. We typically respond within one business day.

Prefer to skip the form? Pick a time on our calendar →
or send a message

← All posts

The Phileas Trino Connector: PII Redaction as SQL

Apache Trino (formerly PrestoSQL) is the federated query engine that more and more organizations are using to query across data lakes, warehouses, and relational sources without ETL'ing data into a single place. The architectural promise is "query data where it lives"; the privacy implication is that sensitive data in any connected source can land in the result set of any query a user runs. PII that was carefully gated in one system becomes exposed the moment Trino joins it with anything else.

The Phileas Trino connector addresses this by making redaction a SQL primitive. You apply Phileas's policy-driven PII detection to any varchar column as part of a normal SQL query — redaction happens inside Trino's compute layer, the result rows go to the user already cleaned. No external service, no ETL stage, no application-layer redaction.

This post walks through what the connector does, how to install and configure it, real query patterns, and where it fits architecturally next to the other ways of doing redaction in a Trino-centric data platform.

What the connector is

The Phileas Trino connector is an Apache 2.0 plugin (Maven coordinates ai.philterd:phileas-connector) that registers a scalar SQL function inside Trino:

phileas_redact(varchar) → varchar

The function takes a string column value, applies the policy file you've configured, and returns the redacted string. Because it's a scalar function rather than a separate connector that exposes its own tables, it composes naturally with everything else Trino can do — SELECT, JOIN, INSERT INTO, CREATE TABLE AS SELECT, subqueries, CTEs, window functions, all of it.

Under the hood, the connector embeds the Phileas library directly inside Trino's worker JVMs. Redaction happens in-process on each worker as rows are processed; there's no network hop to an external service, no per-row API call, no serialization overhead beyond the normal column-value pipeline.

Prerequisites and installation

Three prerequisites:

  • Apache Trino installed and running. The connector tracks Trino's release numbering; the current artifact version (475 at the time of writing) aligns with Trino 475.
  • Java 24 for the build environment if you're building from source.
  • A Phileas policy file describing what to redact.

Install in three steps:

# 1. Build (or pull the release JAR)
git clone https://github.com/philterd/phileas-connector
cd phileas-connector
mvn clean package

# 2. Copy the connector plugin into the Trino plugin directory
cp -r ./target/phileas-connector-475 $TRINO_HOME/plugin/phileas

# 3. Create the catalog config
mkdir -p $TRINO_HOME/etc/catalog
cat > $TRINO_HOME/etc/catalog/phileas.properties << 'EOF'
connector.name=phileas
phileas.policy.file=/etc/trino/policies/default-policy.json
EOF

# 4. Restart Trino
$TRINO_HOME/bin/launcher restart

Verify the function is registered after restart:

trino> SELECT phileas_redact('My email is alice@example.com');
                 _col0
---------------------------------------
 My email is ******************
(1 row)

The policy file

The policy file is the same Phileas policy JSON format used everywhere else in the Philterd toolkit. A minimal example that covers email, SSN, credit card, and phone:

{
  "name": "default-policy",
  "identifiers": {
    "emailAddress": {
      "emailAddressFilterStrategies": [{ "strategy": "REDACT" }]
    },
    "ssn": {
      "ssnFilterStrategies": [{
        "strategy": "REDACT",
        "redactionFormat": "***-**-####"
      }]
    },
    "creditCard": {
      "creditCardFilterStrategies": [{
        "strategy": "REDACT",
        "redactionFormat": "****-****-****-####"
      }]
    },
    "phoneNumber": {
      "phoneNumberFilterStrategies": [{ "strategy": "REDACT" }]
    }
  }
}

The full policy surface is much richer — per-entity strategies (mask, encrypt with format-preserving keys, replace with synthetic, drop, hash), custom dictionaries, custom identifier patterns, conditional logic ("only redact zip codes where population < 20,000"), and language settings. The Phileas documentation covers the full schema; or use the Redaction Policy Editor to build a policy visually and export the JSON.

Query patterns

The connector becomes interesting when you start composing redaction into real queries.

Pattern 1: redact a column in a SELECT

The simplest case — query against any catalog (Postgres, Hive, Iceberg, etc.) and redact a free-text column on the way out:

SELECT
    ticket_id,
    created_at,
    customer_tier,
    phileas_redact(transcript) AS transcript_redacted
FROM postgres.support.tickets
WHERE created_at > DATE '2026-04-01';

Pattern 2: materialize a redacted copy with CTAS

Create a clean table from raw data in one query. The destination catalog can be anywhere Trino can write — Hive, Iceberg, Delta, etc. Useful for one-shot ETL or rebuilding a downstream analytics table after a policy change.

CREATE TABLE iceberg.analytics.tickets_redacted AS
SELECT
    ticket_id,
    created_at,
    customer_tier,
    phileas_redact(transcript) AS transcript,
    phileas_redact(subject) AS subject
FROM postgres.support.tickets;

Pattern 3: redact during a JOIN across federated sources

Trino's federated query story is what makes the connector especially useful. You can JOIN a customer table from Postgres with a ticket table from Iceberg and a chat log from S3, applying redaction to free-text columns in each, all in one query.

SELECT
    c.customer_id,
    c.tier,
    phileas_redact(t.transcript) AS ticket_text,
    phileas_redact(cl.message)   AS chat_text
FROM   postgres.crm.customers   c
JOIN   iceberg.support.tickets  t  ON c.customer_id = t.customer_id
JOIN   hive.logs.chat_messages  cl ON c.customer_id = cl.customer_id
WHERE  c.created_at > DATE '2026-04-01';

Pattern 4: build a view that downstream consumers query

Wrap the redaction logic in a SQL view; expose only the view to analytics consumers. The raw table stays accessible only to the small set of users who have a legitimate need.

CREATE VIEW analytics.tickets_safe AS
SELECT
    ticket_id,
    created_at,
    customer_tier,
    phileas_redact(transcript) AS transcript,
    phileas_redact(subject)    AS subject
FROM postgres.support.tickets;

-- Grant analytics teams access to the view, not the underlying table
GRANT SELECT ON analytics.tickets_safe TO ROLE analytics;

This is the cleanest pattern for most analytics use cases — redaction is centralized in one view definition, governed via standard SQL grants, and applied transparently to every query that hits the view.

Pattern 5: combine with Trino's row-level security

Trino supports row filters via the access-control.properties framework. Combine row-level filtering with column-level redaction for fine-grained governance: an analyst sees only the rows for their region, with sensitive text columns redacted.

-- View definition redacts text columns
CREATE VIEW analytics.tickets_safe AS
SELECT
    ticket_id,
    region,
    phileas_redact(transcript) AS transcript
FROM postgres.support.tickets;

-- Row-level filter (defined via access-control plugin) restricts
-- which regions an analyst can see. Both work together.

Where the connector fits architecturally

The Trino connector is one of three good options for SQL-resident redaction in a data platform. Each fits a different operational profile:

  • Phileas Trino connector (this post) — federated queries across many data sources; redaction happens inside Trino's compute. Best when Trino is your query layer and you want redaction to be a SQL primitive.
  • Snowflake patternsexternal functions, Java UDFs, ETL-stage redaction. Best when Snowflake is the warehouse and you want SQL-resident redaction there.
  • Streaming-stage redactionKafka or Kinesis pipelines. Best when you want to redact before data lands in any warehouse, so every downstream consumer (including Trino itself) sees only clean data.

Many production stacks combine all three: streaming-stage redaction as the primary defense (everything lands clean), with Trino and Snowflake patterns available as belt-and-suspenders for cases where raw data exists historically or for ad-hoc analyst queries against legacy sources.

Performance characteristics

A few rules of thumb from running the connector at moderate scale:

  • Per-call cost. Phileas's redaction time on a typical row (a few KB of text) is sub-millisecond for pattern-only detectors, single-digit milliseconds when NLP detectors fire. The connector adds essentially zero overhead beyond Phileas itself — the function call is in-process, no serialization.
  • Worker parallelism. Trino distributes work across all worker nodes; redaction parallelizes naturally. A 10-worker cluster gives you 10× the throughput of a single-worker setup.
  • Policy load time. Policies are loaded once per worker on first use. A large policy with many custom dictionaries may take a second or two to initialize the first time a query hits it; subsequent queries are warm.
  • Memory. The Phileas library + loaded NLP models add some baseline memory to each Trino worker. For deployments with large healthcare or domain-specific lenses, plan an extra ~500MB per worker.

Operating considerations

A handful of things to think through before standing this up in production:

  • Policy distribution. The policy file path is per-worker. For a multi-node cluster, make sure the policy file is in the same place on every worker (NFS mount, configuration-management deploy, or container image bake-in).
  • Policy updates. The connector loads the policy on initialization. Updating the policy currently requires a worker restart. For workloads where the policy changes frequently, the HTTP-based Philter pattern (external service that loads policies on demand) is the better fit.
  • Audit logging. Trino's query logs capture what was queried; combine with Phileas's optional detection logging to produce the artifact regulators look for — not just "this query ran" but "this query redacted these entity types in these counts."
  • CI for the policy. Treat the policy file as code. Run Philter Scope against a gold-standard test set on every policy change — the compliance-as-code pattern applies identically to Trino-deployed policies.

The bottom line

The Phileas Trino connector turns PII redaction into a SQL primitive, which is exactly what you want when Trino is your query layer. Wrap it in a view; expose the view instead of the underlying table; downstream consumers query SQL and get clean data without ever knowing redaction happened. For federated data platforms that pull from many sources, the connector is often the cleanest place to put the redaction step — redaction happens once, at the layer that consumers actually touch.

The connector is Apache 2.0 on GitHub; the Maven coordinates are ai.philterd:phileas-connector; the underlying library is Phileas. If you're standing up Trino-resident PII redaction for the first time or want help tuning policies against your real data, get in touch.