The problem with most healthcare K8s migrations
Healthcare engineering teams that move to Kubernetes without explicitly designing for HIPAA tend to fall into one of two patterns. Either they over-engineer — building a bespoke control plane, custom audit pipelines, and homegrown encryption layers that produce a system only the original team can run. Or they under-engineer — running stock K8s without addressing the controls auditors actually inspect, and discovering the gap during the next assessment.
Both failure modes are avoidable. The patterns below come from a recent engagement at a Fortune 500 health system: 340 microservices, two regions, multi-EHR integrations, and a quarterly audit that had been finding patch-latency issues for three years running. The migration cleared the next audit with zero material findings.
Pattern 1: PHI logging at the platform, not the application
Every application logging PHI directly is a HIPAA exposure waiting to happen. The platform pattern: an admission webhook that injects a sidecar (or eBPF agent) that captures and tokenizes PHI in logs before they leave the pod. Applications log freely; the platform redacts deterministically; auditors see consistent treatment across services.
What we used: OPA Gatekeeper for admission policies, custom webhook for sidecar injection, and a Loki-based aggregator that enforces a redaction filter at ingestion time. PHI never enters cluster-wide log storage in plaintext.
Pattern 2: Network microsegmentation by namespace as tenant
Default-deny network policies, with explicit ingress/egress declarations per namespace. We treat each namespace as a tenant — clinical workloads can't talk to billing workloads can't talk to research workloads, regardless of intent.
Tools: Cilium for L3/L4/L7 policy with eBPF observability, plus NetworkPolicy resources versioned in Git alongside application code. Policy violations get caught in CI before deploy, not at runtime.
Pattern 3: Workload identity, not service accounts with passwords
The death of long-lived service-account credentials. Every workload authenticates via short-lived tokens issued by a workload-identity provider — AWS IAM Roles for Service Accounts, GKE Workload Identity, or Azure Workload Identity Federation depending on your cloud.
The audit value: every API call from a workload is traceable to a specific pod identity at a specific point in time. No shared credentials, no rotation drama, no "we don't know who used that password."
Pattern 4: BAA-cleared cluster configuration as code
The cluster configuration that satisfies your cloud provider's BAA is not the default cluster configuration. Encryption-at-rest with customer-managed keys, audit logging to immutable storage, restricted node-to-control-plane traffic, and explicit BAA scope for every dependency.
Pattern: codify the BAA-cleared baseline in Terraform modules. Every cluster is born compliant. Drift detection catches anyone deviating. The compliance team reviews the modules once; subsequent clusters are audit-ready by default.
Pattern 5: Phased cutover with parallel run
Don't big-bang. We use a phased migration with three properties:
- Lowest-risk workloads first. Reporting services that can run in either system without consequence. Build operational muscle memory before touching PHI-handling services.
- Parallel run for at least 30 days. Both systems serving traffic, results compared bit-for-bit. Differences investigated before cutover.
- PHI-touching services with red-team review. The penultimate phase. Senior security engineers from your side review the migration design before any patient record touches the new platform.
Pattern 6: Postmortem culture from day one
The platform you build is only as good as the operational discipline running it. The migration is the time to install postmortem practice — blameless, written, root-cause-driven. Auditors love seeing this in interviews; engineers benefit from the institutional memory; future incidents get shorter every quarter.
We use a simple template: incident summary, impact, timeline, root cause analysis (5 whys), corrective actions with owners and dates. Every postmortem reviewed at a weekly engineering forum.
What this combination produced
For the engagement that generated these patterns:
- Deploy time: 4 hours → 12 minutes
- Audit findings on patch latency: cleared in next cycle
- PHI access auditing: compliance queries answered in seconds, not days
- Operational headcount freed up: 2 engineers from manual operations to platform improvement work
None of these patterns is novel in isolation. The leverage comes from the combination — and from the discipline to ship them as a system rather than as separate initiatives that lose energy halfway through.