Install
High availability
A highly available Wardengate deployment is three control plane replicas across failure domains, a Postgres with synchronous replication, a Redis or Valkey with a failover story, and a load balancer in front. There is nothing unusual in that list — the platform is deliberately shared-nothing so standard patterns apply.
Reference architecture
The diagram below is three availability zones, each running one control plane replica, with shared Postgres and Redis that either live in managed services or in an HA cluster of their own.
+-----------------+
| Load balancer | (L4, preserves client IP)
+--------+--------+
|
+---------------+---------------+
| | |
+----v----+ +----v----+ +----v----+
| CP #1 | | CP #2 | | CP #3 | (AZ-a, b, c)
+----+----+ +----+----+ +----+----+
| | |
+-------+-------+-------+-------+
| |
+-----v-----+ +-----v-----+
| Postgres | async | Valkey | (Sentinel /
| primary +---------->+ primary | cluster)
+-----+-----+ +-----+-----+
| |
+----v----+ +----v----+
| replica | | replica |
+---------+ +---------+Gateways are first-class citizens of the control plane binary — the same image runs both. In larger deployments you can dedicate node pools to gateways and to the control plane, but for most topologies the same replica set handles both roles.
Stateless control plane
The control plane holds no durable state on local disk beyond TLS material and log scratch. Durable state lives in Postgres, ephemeral session state lives in Redis, and recordings drain to object storage. A replica can be lost and recycled without touching the others — there is no leader election for most operations.
A small set of background jobs (periodic cleanup, recording manifest signing, notification fan-out) use a Redis-based lease to elect a temporary leader. Leases expire quickly, so a pod crash is visible in metrics but not in user-facing behaviour.
Sticky sessions are not required
User-facing session state (live SSH channels, RDP streams, approval requests in flight) lives in Redis, keyed by session ID and signed with the control plane’s internal key. Any replica can serve any request. Disable session affinity on your load balancer — it just adds skew under pod churn without buying anything.
The one exception is a long-lived streaming request (web-based terminal, session playback). The LB must not cut the TCP connection mid-stream, so keep read timeouts generous — 30 minutes is a reasonable starting point.
Postgres HA
Postgres is the hard part of the dependency chain — the platform cannot accept new sessions when the database is unavailable. Pick the HA pattern your team already runs. Anything that presents a stable endpoint to Wardengate works.
- Managed (recommended) — RDS Multi-AZ, Cloud SQL HA, Azure Flexible Server zone-redundant. Failover is the provider’s problem.
- Patroni + etcd/Consul — streaming replication with automated failover. Point Wardengate at the Patroni-managed VIP or HAProxy.
- CrunchyData Postgres Operator — Kubernetes-native, pgBackRest built in, replicas managed as CR objects.
- Streaming replication with a hot standby — simplest but requires manual failover; acceptable for small teams with strong runbooks.
Regardless of flavour: synchronous replication is strongly recommended, connection pooling (PgBouncer, pgcat, RDS Proxy) is strongly recommended for anything beyond the smallest tier, and short statement_timeout limits keep runaway queries from stalling the fleet.
Redis / Valkey HA
Wardengate treats Redis as a cache with durable semantics for session state. If it goes away, active sessions drop and new ones refuse to start; they do not lose history. Your two sensible options:
- Sentinel — primary plus at least two replicas, plus three Sentinels across failure domains. Wardengate understands Sentinel URLs natively.
- Cluster — for large deployments where you want to shard state across nodes. Use a cluster-aware URL.
# Sentinel
WG_REDIS_URL=sentinel://:password@s1.example:26379,s2.example:26379,s3.example:26379/mymaster
# Cluster
WG_REDIS_URL=rediss://cluster.internal:6379?cluster=trueLoad balancer health checks
Send health checks to /healthz on 8443 — not /readyz. /readyz reports downstream dependencies and will flap a replica out of rotation on a brief Postgres hiccup, which tends to make the outage worse rather than shorter.
- Interval: 5 seconds
- Unhealthy threshold: 3 consecutive failures
- Healthy threshold: 2 consecutive successes
- Timeout: 2 seconds
- Path:
GET /healthz - Protocol: HTTPS (skip CA verification if you use an internal CA)
Graceful drain
Before a replica is terminated, mark it for drain so in-flight sessions finish and new sessions go elsewhere. Kubernetes triggers this automatically via preStop; on bare metal call it explicitly.
wgctl system drain --timeout 10m
# returns 204 once all owned sessions have ended
# then systemd stop / kubectl delete pod is safedrainflips the replica’s readiness bit, sets an eviction window on live sessions (they keep running, new ones route elsewhere), and returns when either everything has drained or the timeout elapses. The chart’s default terminationGracePeriodSecondsis 300s — increase if your policies allow longer sessions.
Split-brain considerations
Wardengate does not invent its own consensus. All consistency guarantees come from Postgres — two replicas cannot both commit conflicting writes because Postgres is the serialisation point. In a genuine split where one zone is isolated from the others, the isolated replica:
- Loses its database connection within the statement timeout
- Fails its liveness probe within a health check cycle or two
- Gets pulled from the LB
- Refuses new sessions and disconnects existing ones with a recoverable error the client retries against a different replica
Plan for the dependency’s split-brain story — Patroni fencing, Sentinel quorum, managed service failover. Wardengate inherits whatever guarantees they offer.
Observability for HA
wg_replica_up— one per control plane replica; page on < N-1 healthywg_db_roundtrip_seconds— Postgres latency from each replica; look for divergencewg_redis_failover_total— counter that increments on each Redis topology changewg_sessions_active— per-replica session count; sudden zeros after a rollout usually mean sticky sessions got re-enabled somewherewg_leader_leases_gauge— background job leaseholders; should tick across replicas over time
Multi-region considerations
A single control plane region plus gateway fleets in other regions is supported out of the box — gateways dial the control plane over the agent gRPC port and stay up even if the control plane is across the world. A fully active-active multi-region control plane is not supported; Postgres is the bottleneck and cross-region synchronous replication is a trade-off most teams do not actually want. For cross-region DR, keep an asynchronous read replica and follow the restore procedure in Backup & restore.
Related
- Kubernetes & Helm — the underlying chart
- Backup & restore — DR for the stateful tier
- Upgrading — rolling upgrades across an HA deployment