Closing the RLS hole in TwoOps

The deep security audit on TwoOps surfaced something I'd been worried about but hadn't put on paper: row-level security was enabled on roughly 30 tables and doing nothing.

Two reasons. First, the app connects to Postgres as the table owner, and Postgres lets the owner bypass non-FORCED RLS. Second, withTenantContext — the helper that sets app.current_tenant_id for the session — had zero production call sites. So the only thing standing between tenant A and tenant B's data was the discipline of hand-writing .where(eq(tenantId, ...)) correctly across fifty query files, forever, with no defense in depth.

That's not a security model. That's a hope.

Phase 1: the plumbing

The first commit (SEC-1 Phase 1) flipped the database role to a non-owner, added FORCE ROW LEVEL SECURITY to the affected tables, and wired withTenantContext into every tRPC procedure. tRPC was the easy half — the context is already scoped per request, the tenant id is already on the session.

The hard half was everything that doesn't go through tRPC.

Phase 2: the audit

I went through every non-tRPC caller of the DB — schedulers, webhook handlers, background jobs — and wrote the punch list into docs/security-rls-rollout.md. Ten files. Seven scheduler jobs (alert-rule-evaluator, drift detection, recommendation refresh, etc.), three webhook handlers (Stripe, Polar, GitHub).

The scheduler jobs were mechanical: iterate tenants, wrap the per-tenant block in withTenantContext. The webhook handlers were the interesting case. The customer-id-to-tenant-id lookup has to run before you can set tenant context — there's no tenant yet, just a Stripe customer id arriving from the outside. Once TWOOPS_FORCE_RLS=true, that lookup would silently return empty and the webhook would 200-OK its way into a black hole.

Fix: a separate adminDb handle that uses DATABASE_URL_ADMIN (a role with BYPASSRLS), used exclusively for the resolver step. The handler then enters withTenantContext for the actual write. One narrow, named exit from the RLS regime, instead of an entire connection pool that pretends the regime exists.

The other two blockers

Same sweep closed SEC-2 and SEC-3. The dev-auth bypass was the worst of the three on paper: TWOOPS_ENABLE_DEV_AUTH was an OR-gate next to the NODE_ENV !== "production" check, meaning a single env var on a staging environment that shared the prod DB enabled password-less email sign-in. Removed the env-var leg. Added a runtime NODE_ENV === "production" assertion inside the authorize callback as belt-and-suspenders. Rate-limiting on auth endpoints got reattached — it had been silently dropped during a middleware refactor months ago.

The launch-readiness bits

Alongside the security work, a small cluster of operational polish: /api/health now pings DB, Redis, and the ai-engine in parallel and returns 503 on degraded, so the readiness probe actually drops a sick pod from the LB. /api/health/live stays lightweight so a downstream blip doesn't cascade-kill the web pod. Compliance got a row-level "Generate Fix" button for parity with drift's ergonomics. Dark mode tints on the compliance policy cards finally stopped looking broken.

What I'm flipping next

TWOOPS_FORCE_RLS=true in prod. The whole point of the last two weeks was making that flag a safe behavior change rather than a silent-failure trap. We'll see how clean the cutover is.

Boring security work. The kind that nobody notices when it ships, which is the whole job.