Replacing bastion fleets: a migration playbook
Most bastion estates grew by accident. This is how engineering teams unwind them on purpose — with rollback, evidence, and zero change-freeze surprises.
If you run privileged infrastructure long enough, you accumulate bastions. One per region became one per VPC, then one per tier, then one per acquisition. The fleet is rarely documented end-to-end, and nobody is quite sure which keys still work or which ACLs are load-bearing. A migration to a centralized gateway is less about the shiny new control plane and more about surgically removing a dozen accumulated workarounds without the estate noticing.
We work with teams running anywhere from four to four hundred jump hosts, and the migrations that go cleanly share a shape. This playbook captures that shape: what to inventory first, how to stand up Wardengate in parallel, how to cut over a protocol at a time, and how to retire the old fleet with enough evidence that your auditors nod instead of wince.
Stage 1: inventory before you design
Before proposing a target architecture, compile a factual map of what exists. Not what the wiki says exists — what production actually does. Every team we have helped discovers at least one bastion nobody owns and at least one standing SSH tunnel that has been open for longer than the engineer who created it has been employed.
- Which hosts accept inbound SSH, RDP, or database connections from operators today, and from which source ranges.
- Which identities can reach those hosts: human accounts, shared service users, long-lived API keys, and sidecar agents.
- What authorization mechanism actually gates access: an IdP group, a local /etc/passwd entry, a shared private key, or nothing beyond network reachability.
- Which sessions leave evidence, where that evidence is stored, and how long it is retained.
- Which bastions have dependencies on their own hostnames — scripts, monitoring, or runbooks that hard-code bastion.region.internal and will silently break when the host disappears.
This inventory is the backbone of the migration. Every later decision routes through it. Skip it and you will discover the missing rows in production, at 03:00, during an incident where the on-call engineer cannot reach the database tier.
Stage 2: stand the gateway up in shadow mode
Deploy Wardengate alongside the existing bastion fleet, not in place of it. The goal at this stage is to issue real entitlements, broker real sessions, and generate real evidence — but without being anyone's only path. Treat it as a shadow control plane for two to three weeks.
Give a pilot team real targets. Ask them to prefer the gateway but keep bastion access as a fallback. What you are looking for is not outages; it is friction. Every complaint that sounds like 'I tried to run ansible and it timed out' is a policy gap you want to find now, while the old path still works.
Stage 3: cut over by protocol, not by team
The most common migration mistake is cutting over one team at a time. It feels safer, but it leaves the old fleet alive indefinitely — because there is always one more team to migrate, and the long tail never ends. Cut over by protocol instead.
Move SSH first. It is the most heavily used protocol in almost every estate, which means it generates the most feedback fastest. Drop the source allowlists on the bastions to the gateway only, watch the logs for legitimate traffic that hits a denied rule, and adjust policy. Two weeks of SSH on the gateway will surface more policy gaps than two months of careful planning.
Then RDP, then databases, then anything bespoke. The last protocol is always the weird one: a vendor management console, a hardware appliance with a captive telnet shell, a third-party support tool. Plan for it to take longer than everything else combined.
Stage 4: collapse the evidence chain
Before you retire a single bastion, confirm that the session evidence it was generating is also being captured at the gateway — and ideally improved. Auditors will not be impressed that you consolidated access if the consolidation happens to also consolidate your evidence into a gap. Export keystroke, file transfer, and command records from Wardengate into your SIEM and run a week of parallel collection.
A useful pattern is to pick a specific auditor scenario — 'show me every privileged session against the cardholder data environment last quarter, with named identity and approval record' — and run that query against both the old and new evidence stores. If the new store answers the question more completely, you are ready to cut.
Stage 5: retire and record
Now the old bastions can go. Disable inbound access first; leave the hosts up for a grace period so monitoring alarms surface. After two weeks of silence, terminate the hosts and remove the associated security group rules. Capture the teardown in a change record your auditors can see alongside the consolidation rationale.
# Typical bastion retirement check: nothing reaching the old host
$ aws ec2 describe-flow-logs --filter Name=resource-id,Values=eni-0bastion \
--start-time $(date -d '14 days ago' +%s) --query 'Records[?action==`ACCEPT`]' \
| jq 'length'
0What changes for your operators
Your operators do not want a new tool. They want their old tool — ssh, their IDE, their database client — to keep working. A good migration does not ask them to learn a new UI for connecting. They still run ssh user@target; Wardengate just happens to be in front of it. The new behavior is that their session is now tied to a named identity, is time-bound, and is recorded. That is the point. The day your operators forget the migration happened is the day it succeeded.
A one-page migration ledger
Finally, keep the migration honest. A single-page ledger, updated weekly, tends to beat an elaborate tracker that nobody opens. Track four numbers: bastions still active, targets reachable through the gateway, identities bound to policy, and sessions recorded in the last seven days. When the first number hits zero and the other three stop growing, the migration is done. Write that up, file it, and move on.