Server migrations between cloud providers have a reputation for going wrong — and that reputation is earned. The combination of irreversible DNS changes, stateful database synchronization, and compressed cutover windows creates conditions where mistakes become production incidents quickly.
Most migrations don't fail because of technical complexity. They fail because of predictable, avoidable mistakes that have happened enough times to be documented: not reducing DNS TTL before cutover, not setting up streaming database replication before switching traffic, not having a tested rollback path. The pattern of failures is consistent enough that it's possible to define a standard migration procedure that eliminates virtually all of them.
The Five-Phase Migration Pattern
DNS TTL: The Most Skipped Step
DNS TTL (Time To Live) controls how long DNS resolvers around the world cache a record before re-checking it. A TTL of 3600 means that after you change your DNS record, some resolvers will continue serving the old IP for up to one hour.
For a migration, this creates a specific problem: if you change DNS at your cutover time without having pre-reduced the TTL, you have two servers receiving production traffic simultaneously — the old one serving requests from resolvers with the old TTL cached, and the new one serving requests where the new record has propagated. This dual-serving period is where data consistency issues happen.
The TTL currently set on your DNS record is the propagation time. A record with TTL=86400 (24 hours) will take up to 24 hours from when you reduce it for the reduced TTL to fully propagate. Start the TTL reduction 48–72 hours before your cutover. Reduce in steps if needed: 86400 → 3600 → 300 → 60, each change taking effect over one full TTL period.
Database Migration: Streaming, Not Dump-and-Restore
The dump-and-restore approach to database migration — take a snapshot, stop writes, export, copy to new server, import, resume writes — requires a maintenance window proportional to database size. For databases over a few gigabytes, this is measured in minutes to hours of downtime. It's the wrong approach for any service with availability requirements.
Streaming replication eliminates the maintenance window by keeping the destination database synchronized with the source throughout the migration. The cutover sequence with streaming replication is:
- Verify replication lag is under 1 second on the destination database
- Update application configuration to connect to the destination database
- Verify application is writing to destination and reads are succeeding
- Stop replication (source writes can continue if you're running both environments post-cutover)
The total window where both databases may be slightly out of sync is measured in milliseconds to seconds — not hours.
For PostgreSQL logical replication, create a replication slot on the source, set wal_level = logical in postgresql.conf, and use CREATE PUBLICATION on the source and CREATE SUBSCRIPTION on the destination. Monitor replication lag via pg_stat_replication. For migrations from an older PostgreSQL version where logical replication isn't available, pglogical provides equivalent functionality.[1]
The Rollback Plan: Not Optional
A rollback plan that hasn't been tested is just a document. For a migration, the rollback plan must be tested in a staging environment that mirrors the production migration conditions — not just theorized.
Before executing any production cutover, you must have answers to:
- What's the rollback trigger? Define the specific error rate, latency percentile, or health check failure that triggers a rollback decision. Don't make this a judgment call under incident stress.
- Who can authorize rollback? Have a named person with the authority to make the call present at the cutover keyboard.
- How long will rollback take? With TTL at 60 seconds, DNS rollback takes ~60 seconds. Test this in staging.
- What data is at risk if rollback happens after writes? If users write to the new environment before a rollback occurs, those writes may not be present on the old environment. Plan for this specifically.
- How long does the rollback window last? Set a specific end time — e.g., 72 hours post-cutover — after which the old environment is decommissioned regardless.
The Five Most Common Migration Failure Patterns
These failure patterns appear consistently in post-migration incident reports:
- DNS TTL not reduced. Team discovers at cutover time that the current TTL is 3600. Traffic continues hitting the old server for up to an hour post-change. During this window, configuration differences between old and new environments cause inconsistent behavior for different users.
- Database dump-and-restore with underestimated size. Export takes 4× longer than expected due to database growth. The maintenance window extends far past the planned window. Team must either continue through the early morning or cancel and reschedule.
- Environment variable differences. The new server is missing an environment variable that was set directly on the old server — never in source control. A specific feature breaks silently. Discovered by users before the team notices in monitoring.
- Firewall/security group differences. The new cloud provider requires explicit outbound allow rules that the old provider permitted by default. Outbound API calls to third-party services fail silently until logs are checked.
- SSL certificate not installed on new server. Certificate was manually installed on the old server years ago and never documented. HTTPS fails on the new server immediately after DNS cutover.
Items 3, 4, and 5 are prevented by a thorough validation phase (Phase 3) — which is why running partial real traffic through the new environment before full cutover is essential, not optional.
Frequently Asked Questions
References
- PostgreSQL Global Development Group (2024). Logical Replication. PostgreSQL 16 documentation, Chapter 31. Covers publication/subscription setup, monitoring replication lag via pg_stat_replication, and handling schema changes during logical replication. postgresql.org/docs/current/logical-replication.html
- IETF RFC 1034 (1987). Domain Names — Concepts and Facilities. Section 3.6 defines the DNS TTL field semantics. RFC 1912 §2.2 provides operational guidance recommending TTLs of 1 day for stable records, with ability to reduce temporarily for planned changes. rfc-editor.org/rfc/rfc1034