Executive Summary1. Operator ComparisonCloudNativePG (CNPG) -- STRONGLY RECOMMENDEDZalando Postgres Operator -- 3rd PlaceCrunchyData PGO -- 2nd PlacePercona Operator -- 4th PlaceOperator Comparison Table2. Multi-Tenancy PatternsPattern 1: One PG Instance Per Project/NamespacePattern 2: Shared HA Cluster with Multiple DatabasesPattern 3: Hybrid Approach -- RECOMMENDEDResource Comparison (10 projects)3. Self-Hosted Alternatives Assessment4. High AvailabilityReplicationAutomatic FailoverFailover Timing5. StorageHetzner Storage Options6. Performance TuningPostgreSQL ConfigurationCritical NotesPgBouncer SizingK8s vs Bare-Metal Performance7. Backup & RecoveryStrategyBackup VerificationDR Testing Calendar8. Monitoring & AlertingKey Metrics to MonitorMinimum Alert RulesGrafana Dashboards9. Final Architecture10. SourcesOperatorsAlternativesMulti-TenancyInfrastructure
Executive Summary
After researching PostgreSQL operators, multi-tenancy patterns, self-hosted alternatives, HA configurations, storage options, performance tuning, backup strategies, and monitoring -- here is the unified recommendation for our Hetzner K3s cluster.
Recommendation: Use CloudNativePG operator with a hybrid multi-tenancy approach (start with one shared HA cluster, promote critical projects to dedicated clusters as needed). Use local NVMe storage, PgBouncer in transaction mode, and backups to Hetzner Object Storage.
1. Operator Comparison
CloudNativePG (CNPG) -- STRONGLY RECOMMENDED
The clear winner for our use case. CNCF Sandbox project (applying for Incubation). 7,700+ GitHub stars, 880 commits/year, 132M+ image downloads. Apache 2.0 license.
- HA: Kubernetes-native failover (no Patroni/etcd dependency). Quorum-based failover (stable in v1.28). Self-healing with auto pod restart, replica promotion, rolling updates.
- Backup: Built on Barman. S3-compatible storage, continuous WAL archiving, full PITR, scheduled base backups, compression & encryption.
- Monitoring: Built-in Prometheus exporter with customizable SQL metrics. PodMonitor auto-creation. Official Grafana dashboard.
- Connection Pooling: Native PgBouncer via dedicated
PoolerCRD. Separate, scalable PgBouncer pods.
- K3s/Hetzner: Proven in production on K3s + Hetzner (Brella case study: zero issues after 7 months).
- GitOps: Fully declarative CRDs -- perfect for infrastructure-as-code repos.
- Multi-tenancy: Namespace-based isolation. Cluster-wide or namespace-scoped operator installation.
One caveat: Failover time on Hetzner K3s can be ~5 minutes for node failures (vs ~30s on cloud providers) due to Hetzner's node detection speed. This is infrastructure-level, not a CNPG issue.
Zalando Postgres Operator -- 3rd Place
- ~4,100 GitHub stars. NOT a CNCF project. Release cadence slowing.
- Built on Patroni + Spilo. Proven at scale inside Zalando.
- Unique Team API for multi-tenancy (best among all operators).
- WAL-G for backups.
- Community momentum shifting to CNPG. Harder to recommend for new deployments in 2026.
CrunchyData PGO -- 2nd Place
- ~3,900 GitHub stars. Oldest operator (production since 2017). Backed by Crunchy Data.
- Built on Patroni. Best reliability test results in independent benchmarks.
- pgBackRest for backup (gold standard for large databases -- block-level incremental, parallel backup/restore).
- More complex initial setup than CNPG. Kustomize-first approach.
- Strong choice if you need pgBackRest or already have Patroni expertise.
Percona Operator -- 4th Place
- ~72 GitHub stars. Built on top of CrunchyData PGO.
- Smallest community -- significant risk for small teams needing community support.
- PMM integration for monitoring (can be heavy).
- Not recommended unless already invested in the Percona ecosystem.
Operator Comparison Table
Feature | CloudNativePG | Zalando | CrunchyData PGO | Percona |
GitHub Stars | ~7,700 | ~4,100 | ~3,900 | ~72 |
CNCF Status | Sandbox (applying Incubation) | None | None | None |
HA Foundation | K8s-native | Patroni | Patroni | Patroni (via PGO) |
Backup Tool | Barman | WAL-G | pgBackRest | pgBackRest |
PgBouncer | Yes (Pooler CRD) | Yes (sidecar) | Yes | Yes |
Prometheus | Built-in exporter | Community add-on | Built-in | PMM / Prometheus |
K3s/Hetzner Tested | Yes (production) | Yes | Yes | Yes (documented) |
Multi-tenancy | Namespace RBAC | Team API (best) | Namespace RBAC | Namespace RBAC |
Release Cadence | Very high | Slowing | Moderate | Moderate |
Complexity | Low | Medium | Medium-High | Medium-High |
2. Multi-Tenancy Patterns
Pattern 1: One PG Instance Per Project/Namespace
Each project gets its own dedicated PostgreSQL cluster (primary + replicas).
Pros: Full isolation, independent scaling, independent backups/PITR, independent upgrades, strongest security boundary, no noisy neighbors.
Cons: ~2GB memory per project minimum (primary + replica). 10 projects = ~20-40GB memory. More clusters to monitor, more backup schedules, storage fragmentation.
When to use: Strict compliance/security requirements, very different workload profiles, when you can afford the resource overhead.
Pattern 2: Shared HA Cluster with Multiple Databases
One HA PostgreSQL cluster shared across all projects with database-level isolation.
Pros: Resource efficient (4-8GB for 10-15 databases vs 20-40GB). One cluster to monitor/backup/upgrade. Simpler networking. PgBouncer routes per-database.
Cons: Full blast radius (cluster down = ALL projects down). Noisy neighbor risk. PITR is all-or-nothing (cannot restore one database independently). Shared upgrade cycle. Security relies on PostgreSQL RBAC, not network isolation.
When to use: Resource-constrained environments, small team, similar workload profiles, non-critical applications.
Pattern 3: Hybrid Approach -- RECOMMENDED
Best of both worlds. Critical apps get dedicated PG instances; less critical apps share a common cluster.
Tier 1 (Dedicated): Production-critical, high-traffic, or compliance-sensitive projects.
Tier 2 (Shared): Internal tools, dev environments, low-traffic microservices.
A project should get a dedicated instance when:
- It handles PII, payment data, or has compliance requirements
- High write throughput or large dataset (>50GB)
- Higher SLA than other projects
- Needs independent scaling or upgrade schedule
A project can use the shared cluster when:
- Internal tool or low-traffic service
- Non-sensitive data
- Occasional latency spikes are acceptable
- Small dataset (<5GB)
Resource Comparison (10 projects)
Resource | 10 Dedicated | 1 Shared | Hybrid (2+1) |
Memory | 20-40 GB | 4-8 GB | 8-16 GB |
CPU | 5-10 cores | 2-4 cores | 3-6 cores |
PVCs | 20-30 | 3 | 9 |
PgBouncer Instances | 10 | 1 | 3 |
Backup Schedules | 10 | 1 | 3 |
3. Self-Hosted Alternatives Assessment
Solution | Production Ready | K3s/Hetzner | Complexity | Verdict |
Neon (self-hosted) | No | Poor (needs NVMe + S3) | Very High | NOT RECOMMENDED |
Supabase (self-hosted) | Partial | Yes | High | NOT RECOMMENDED (overkill) |
Bitnami Helm Charts | No (deprecated) | Yes | Low | NOT RECOMMENDED |
StackGres | Yes | Yes | Medium | CONDITIONALLY RECOMMENDED |
Tembo | Early GA | Yes (untested) | Medium | WATCH |
- Neon: Serverless PG features (scale-to-zero, branching) unavailable in self-hosted mode. Operationally demanding. Not production-ready.
- Supabase: Full platform (auth, APIs, realtime) is overkill if you just need PostgreSQL. Community-supported only, no official K8s support.
- Bitnami: Being deprecated. No automated failover, no backup management, no PITR out of the box.
- StackGres: Solid choice if you want batteries-included with a web console. Patroni HA + WAL-G backups + PgBouncer + Prometheus. Heavier pod footprint than CNPG.
- Tembo: Interesting Rust-based operator with 200+ extensions and pre-built Stacks. Too young for critical production bet.
Key takeaway: None beat a well-configured CloudNativePG operator for our use case.
4. High Availability
Replication
- Async replication (recommended default): Primary doesn't wait for replicas. RPO = replication lag (1-5s). No write latency penalty.
- Sync replication: Primary waits for replica confirmation. RPO approaches zero. Adds 1-3ms latency within same DC. Use only for zero-RPO databases.
- Quorum-based sync:
ANY 1 (replica1, replica2)provides sync durability without single-replica dependency.
Automatic Failover
- CNPG uses K8s-native leader election (no Patroni/etcd needed)
- Self-healing: auto pod restart, replica promotion, rolling updates
- Split-brain prevention through K8s leader election primitives
Failover Timing
Scenario | RTO | RPO | Procedure |
Single replica failure | 0 (no impact) | 0 | Operator auto-recreates |
Primary failure (same DC) | 30-60s | 0-5s (async) / 0 (sync) | Auto-failover |
Full DC failure | 5-15 min | Replication lag | Promote cross-DC replica |
Data corruption | 15-60 min | To point before corruption | PITR restore |
Complete cluster loss | 1-4 hours | Last WAL archived | Restore from S3 backup |
5. Storage
Hetzner Storage Options
Option | IOPS | Latency | Replication | Best For |
Local NVMe (recommended) | 100K+ | Microseconds | None (use PG replication) | Primary DB, max performance |
Longhorn | ~19K | Higher | Built-in 2-3x | Simpler ops |
OpenEBS Mayastor | ~28K | NVMe-over-TCP | Configurable | High-perf with replication |
Hetzner Volumes | ~15K | Milliseconds | Hetzner-managed | Avoid for primary PG |
Recommendations:
- Use local NVMe via LocalPV for primary PostgreSQL (within 5-10% of bare-metal performance)
- Use Longhorn if you prefer storage-level replication as a safety net (~30-40% IOPS cost)
- Use separate WAL volume via CNPG
walStoragespec for parallel I/O
- StorageClass:
reclaimPolicy: Retain,allowVolumeExpansion: true,WaitForFirstConsumer
Minimum IOPS targets: 3,000+ random read, 1,000+ random write, <1ms p99 latency for 8K reads.
6. Performance Tuning
PostgreSQL Configuration
Parameter | Formula | Example (8GB RAM, 4 CPU) |
shared_buffers | 25% of RAM | 2GB |
effective_cache_size | 50-75% of RAM | 6GB |
work_mem | RAM / (max_connections * 4) | 16MB |
maintenance_work_mem | 5-10% of RAM | 512MB |
max_connections | Low (use PgBouncer) | 100-200 |
random_page_cost | 1.1 for NVMe/SSD | 1.1 |
effective_io_concurrency | 200 for NVMe/SSD | 200 |
max_wal_size | 2-4GB for write-heavy | 4GB |
Critical Notes
K8s defaults /dev/shm to 64MB. If
shared_buffers exceeds this, PostgreSQL will FAIL to start. Most operators handle this automatically -- verify yours does.CPU pinning is the single biggest tuning lever. Use Guaranteed QoS (requests == limits) with CPU Manager static policy. Benchmarks show +22% average read/write TPS and -76% write latency with NUMA affinity.
PgBouncer Sizing
Setting | Value | Rationale |
pool_mode | transaction | Stateless apps (most common) |
default_pool_size | 20-30 | Per user/database pair |
max_client_conn | 1000-5000 | PgBouncer connections are lightweight |
max_db_connections | 100 | Should be < max_connections |
K8s vs Bare-Metal Performance
- Local NVMe + CPU pinning + huge pages: Within 5-10% of bare-metal
- Local NVMe, no CPU pinning: Within 15-20%
- Longhorn/network storage: 30-50% slower
7. Backup & Recovery
Strategy
- Full backup: Weekly (Sunday)
- Incremental backup: Daily
- WAL archiving: Continuous (every completed 16MB WAL segment)
- Retention: 30 days of backups, 7 days of WAL
- Target: Hetzner Object Storage (S3-compatible) or MinIO
Backup Verification
Backups don't protect your business -- proven restores do. Schedule weekly automated restore tests to a temporary cluster. Monitor backup age (alert if >25 hours), size trends, and WAL archiving lag.
DR Testing Calendar
- Monthly: Backup restore verification (automated)
- Quarterly: Simulated primary failure + failover drill
- Semi-annually: Full DR exercise (restore from backup to fresh cluster)
8. Monitoring & Alerting
Key Metrics to Monitor
- Health:
pg_up, postmaster uptime
- Connections: Active count by state, utilization vs max_connections (alert > 80%)
- Performance: Cache hit ratio (should be > 99%), TPS, deadlocks
- Replication: Replay lag in seconds (alert > 30s warning, > 300s critical)
- Storage: Database size growth, disk usage (alert > 85%)
- Backups: Last backup age (alert > 25 hours), WAL archiving failures
Minimum Alert Rules
- PostgreSQL down (critical)
- Connection utilization > 80% (warning)
- Replication lag > 30s / > 300s (warning / critical)
- Cache hit ratio < 99% (warning)
- Backup age > 25 hours (critical)
- Disk usage > 85% (warning)
- WAL archiving failures (warning)
- Deadlocks detected (warning)
Grafana Dashboards
Recommended community dashboards: IDs 9628 (PostgreSQL Database) and 14114 (PostgreSQL Overview).
9. Final Architecture
Layer | Choice | Rationale |
Operator | CloudNativePG | CNCF, K3s-proven, lightest, most active |
Multi-tenancy | Hybrid (start shared) | Resource efficient, clear upgrade path |
Storage | Local NVMe via LocalPV | Best IOPS, within 5-10% of bare-metal |
WAL Volume | Separate (CNPG walStorage) | Parallel I/O, disk-full protection |
Replication | Async (default) | Sync only for zero-RPO databases |
Connection Pooling | PgBouncer (CNPG Pooler CRD) | Transaction mode, 20-30 pool size |
Backup Target | Hetzner Object Storage (S3) | Weekly full + daily incr + continuous WAL |
Retention | 30 days | With weekly automated restore verification |
Monitoring | Built-in CNPG Prometheus + Grafana | Connections, TPS, replication, cache, disk |
10. Sources
Operators
- CloudNativePG -- cloudnative-pg.io
- Zalando Postgres Operator -- github.com/zalando/postgres-operator
- CrunchyData PGO -- github.com/CrunchyData/postgres-operator
- Percona Operator -- github.com/percona/percona-postgresql-operator
- Brella Case Study (CNPG on Hetzner K3s)
- Palark Operator Comparison -- blog.palark.com
- simplyblock Operator Comparison -- simplyblock.io/blog
Alternatives
- Neon Operator by Molnett -- molnett.com/blog
- Supabase Self-Hosting -- supabase.com/docs/guides/self-hosting
- StackGres -- stackgres.io
- Tembo -- tembo.io
Multi-Tenancy
- CloudNativePG Architecture Docs -- cloudnative-pg.io/documentation
- CNPG Discussion #497 -- Multiple Databases
- CNPG Discussion #2357 -- Multi-tenant Architecture
- Neon Noisy Neighbor Blog -- neon.com/blog
Infrastructure
- PostgreSQL Tuning for Kubernetes best practices
- PgBouncer Multi-Tenant at Scale -- dzone.com
- Zalando Engineering -- PgBouncer on Kubernetes
Research conducted April 2026 by a Claude Code agent team: K8s Operator Specialist, Database Architect, Solutions Architect, and Infrastructure Engineer.