Wardengate

Install

High availability

A highly available Wardengate deployment is three control plane replicas across failure domains, a Postgres with synchronous replication, a Redis or Valkey with a failover story, and a load balancer in front. There is nothing unusual in that list — the platform is deliberately shared-nothing so standard patterns apply.

Reference architecture

The diagram below is three availability zones, each running one control plane replica, with shared Postgres and Redis that either live in managed services or in an HA cluster of their own.

               +-----------------+
               |   Load balancer |   (L4, preserves client IP)
               +--------+--------+
                        |
        +---------------+---------------+
        |               |               |
   +----v----+     +----v----+     +----v----+
   |  CP #1  |     |  CP #2  |     |  CP #3  |   (AZ-a, b, c)
   +----+----+     +----+----+     +----+----+
        |               |               |
        +-------+-------+-------+-------+
                |                       |
          +-----v-----+           +-----v-----+
          | Postgres  |  async    |  Valkey   |  (Sentinel /
          | primary   +---------->+  primary  |   cluster)
          +-----+-----+           +-----+-----+
                |                       |
           +----v----+             +----v----+
           | replica |             | replica |
           +---------+             +---------+

Gateways are first-class citizens of the control plane binary — the same image runs both. In larger deployments you can dedicate node pools to gateways and to the control plane, but for most topologies the same replica set handles both roles.

Stateless control plane

The control plane holds no durable state on local disk beyond TLS material and log scratch. Durable state lives in Postgres, ephemeral session state lives in Redis, and recordings drain to object storage. A replica can be lost and recycled without touching the others — there is no leader election for most operations.

A small set of background jobs (periodic cleanup, recording manifest signing, notification fan-out) use a Redis-based lease to elect a temporary leader. Leases expire quickly, so a pod crash is visible in metrics but not in user-facing behaviour.

Sticky sessions are not required

User-facing session state (live SSH channels, RDP streams, approval requests in flight) lives in Redis, keyed by session ID and signed with the control plane’s internal key. Any replica can serve any request. Disable session affinity on your load balancer — it just adds skew under pod churn without buying anything.

The one exception is a long-lived streaming request (web-based terminal, session playback). The LB must not cut the TCP connection mid-stream, so keep read timeouts generous — 30 minutes is a reasonable starting point.

Postgres HA

Postgres is the hard part of the dependency chain — the platform cannot accept new sessions when the database is unavailable. Pick the HA pattern your team already runs. Anything that presents a stable endpoint to Wardengate works.

  • Managed (recommended) — RDS Multi-AZ, Cloud SQL HA, Azure Flexible Server zone-redundant. Failover is the provider’s problem.
  • Patroni + etcd/Consul — streaming replication with automated failover. Point Wardengate at the Patroni-managed VIP or HAProxy.
  • CrunchyData Postgres Operator — Kubernetes-native, pgBackRest built in, replicas managed as CR objects.
  • Streaming replication with a hot standby — simplest but requires manual failover; acceptable for small teams with strong runbooks.

Regardless of flavour: synchronous replication is strongly recommended, connection pooling (PgBouncer, pgcat, RDS Proxy) is strongly recommended for anything beyond the smallest tier, and short statement_timeout limits keep runaway queries from stalling the fleet.

Redis / Valkey HA

Wardengate treats Redis as a cache with durable semantics for session state. If it goes away, active sessions drop and new ones refuse to start; they do not lose history. Your two sensible options:

  • Sentinel — primary plus at least two replicas, plus three Sentinels across failure domains. Wardengate understands Sentinel URLs natively.
  • Cluster — for large deployments where you want to shard state across nodes. Use a cluster-aware URL.
# Sentinel
WG_REDIS_URL=sentinel://:password@s1.example:26379,s2.example:26379,s3.example:26379/mymaster

# Cluster
WG_REDIS_URL=rediss://cluster.internal:6379?cluster=true

Load balancer health checks

Send health checks to /healthz on 8443 — not /readyz. /readyz reports downstream dependencies and will flap a replica out of rotation on a brief Postgres hiccup, which tends to make the outage worse rather than shorter.

  • Interval: 5 seconds
  • Unhealthy threshold: 3 consecutive failures
  • Healthy threshold: 2 consecutive successes
  • Timeout: 2 seconds
  • Path: GET /healthz
  • Protocol: HTTPS (skip CA verification if you use an internal CA)

Graceful drain

Before a replica is terminated, mark it for drain so in-flight sessions finish and new sessions go elsewhere. Kubernetes triggers this automatically via preStop; on bare metal call it explicitly.

wgctl system drain --timeout 10m
# returns 204 once all owned sessions have ended
# then systemd stop / kubectl delete pod is safe

drainflips the replica’s readiness bit, sets an eviction window on live sessions (they keep running, new ones route elsewhere), and returns when either everything has drained or the timeout elapses. The chart’s default terminationGracePeriodSecondsis 300s — increase if your policies allow longer sessions.

Split-brain considerations

Wardengate does not invent its own consensus. All consistency guarantees come from Postgres — two replicas cannot both commit conflicting writes because Postgres is the serialisation point. In a genuine split where one zone is isolated from the others, the isolated replica:

  • Loses its database connection within the statement timeout
  • Fails its liveness probe within a health check cycle or two
  • Gets pulled from the LB
  • Refuses new sessions and disconnects existing ones with a recoverable error the client retries against a different replica

Plan for the dependency’s split-brain story — Patroni fencing, Sentinel quorum, managed service failover. Wardengate inherits whatever guarantees they offer.

Observability for HA

  • wg_replica_up — one per control plane replica; page on < N-1 healthy
  • wg_db_roundtrip_seconds — Postgres latency from each replica; look for divergence
  • wg_redis_failover_total — counter that increments on each Redis topology change
  • wg_sessions_active — per-replica session count; sudden zeros after a rollout usually mean sticky sessions got re-enabled somewhere
  • wg_leader_leases_gauge — background job leaseholders; should tick across replicas over time

Multi-region considerations

A single control plane region plus gateway fleets in other regions is supported out of the box — gateways dial the control plane over the agent gRPC port and stay up even if the control plane is across the world. A fully active-active multi-region control plane is not supported; Postgres is the bottleneck and cross-region synchronous replication is a trade-off most teams do not actually want. For cross-region DR, keep an asynchronous read replica and follow the restore procedure in Backup & restore.

Related