Server migrations between cloud providers have a reputation for going wrong — and that reputation is earned. The combination of irreversible DNS changes, stateful database synchronization, and compressed cutover windows creates conditions where mistakes become production incidents quickly.

Most migrations don't fail because of technical complexity. They fail because of predictable, avoidable mistakes that have happened enough times to be documented: not reducing DNS TTL before cutover, not setting up streaming database replication before switching traffic, not having a tested rollback path. The pattern of failures is consistent enough that it's possible to define a standard migration procedure that eliminates virtually all of them.

The Five-Phase Migration Pattern

Phase
1
Prepare — Environment setup and DNS TTL reduction
Provision the new cloud environment (matching the old server's configuration as closely as possible — same OS version, same software stack, same environment variables). Critically: reduce your DNS TTL to 60 seconds at this stage, not at cutover. DNS TTL changes take time to propagate through the resolver chain. If you make the change now, by the time you cut over, the short TTL will already be in effect everywhere.
⏱ Start: 48–72 hours before planned cutover
Phase
2
Replicate — Live data synchronization
Set up streaming replication from the old environment to the new one. For PostgreSQL, this means logical replication or pglogical. For MySQL/MariaDB, binary log replication. For Redis, RDB/AOF replication or a custom sync script. The new database receives all writes continuously from this point. Allow the replication to run for 24+ hours before moving to validation — this gives you confidence that it's stable.
⏱ Duration: 24–48 hours of stable replication before proceeding
Phase
3
Validate — Test under real conditions
Direct a small percentage of production traffic to the new environment using weighted DNS or a load balancer split. Monitor error rates, latency, and application behavior. Run your full integration test suite against the new environment. Check that all dependent services (external APIs, email providers, payment processors) can reach the new servers. This phase is what catches the environment differences that aren't caught by pre-migration testing.
⏱ Duration: 24–72 hours under partial real traffic
Phase
4
Cutover — Traffic switch with rollback readiness
Update your DNS record to point at the new server's IP. With TTL already at 60 seconds, traffic shifts globally within 60–120 seconds. Monitor dashboards actively for the first 30 minutes. Have the rollback DNS change pre-typed and ready to execute. Keep the old environment running with data still being written to it via replication (so a rollback immediately restores a consistent state without data loss).
⏱ Execute during lowest traffic window — Sunday 2–6 AM in primary market timezone
Phase
5
Decommission — After the rollback window closes
Keep the old environment running for 24–72 hours post-cutover. This is your rollback window — if anything unexpected surfaces after the initial healthy monitoring period, you can revert. Once the rollback window has closed and the new environment is confirmed stable, terminate old servers, cancel old account resources, and restore DNS TTL to a standard value (300–3600 seconds).
⏱ Begin: 24–72 hours after successful cutover

DNS TTL: The Most Skipped Step

DNS TTL (Time To Live) controls how long DNS resolvers around the world cache a record before re-checking it. A TTL of 3600 means that after you change your DNS record, some resolvers will continue serving the old IP for up to one hour.

For a migration, this creates a specific problem: if you change DNS at your cutover time without having pre-reduced the TTL, you have two servers receiving production traffic simultaneously — the old one serving requests from resolvers with the old TTL cached, and the new one serving requests where the new record has propagated. This dual-serving period is where data consistency issues happen.

TTL Reduction Timing

The TTL currently set on your DNS record is the propagation time. A record with TTL=86400 (24 hours) will take up to 24 hours from when you reduce it for the reduced TTL to fully propagate. Start the TTL reduction 48–72 hours before your cutover. Reduce in steps if needed: 86400 → 3600 → 300 → 60, each change taking effect over one full TTL period.

Database Migration: Streaming, Not Dump-and-Restore

The dump-and-restore approach to database migration — take a snapshot, stop writes, export, copy to new server, import, resume writes — requires a maintenance window proportional to database size. For databases over a few gigabytes, this is measured in minutes to hours of downtime. It's the wrong approach for any service with availability requirements.

Streaming replication eliminates the maintenance window by keeping the destination database synchronized with the source throughout the migration. The cutover sequence with streaming replication is:

  1. Verify replication lag is under 1 second on the destination database
  2. Update application configuration to connect to the destination database
  3. Verify application is writing to destination and reads are succeeding
  4. Stop replication (source writes can continue if you're running both environments post-cutover)

The total window where both databases may be slightly out of sync is measured in milliseconds to seconds — not hours.

PostgreSQL Replication Setup

For PostgreSQL logical replication, create a replication slot on the source, set wal_level = logical in postgresql.conf, and use CREATE PUBLICATION on the source and CREATE SUBSCRIPTION on the destination. Monitor replication lag via pg_stat_replication. For migrations from an older PostgreSQL version where logical replication isn't available, pglogical provides equivalent functionality.[1]

The Rollback Plan: Not Optional

A rollback plan that hasn't been tested is just a document. For a migration, the rollback plan must be tested in a staging environment that mirrors the production migration conditions — not just theorized.

Before executing any production cutover, you must have answers to:

  • What's the rollback trigger? Define the specific error rate, latency percentile, or health check failure that triggers a rollback decision. Don't make this a judgment call under incident stress.
  • Who can authorize rollback? Have a named person with the authority to make the call present at the cutover keyboard.
  • How long will rollback take? With TTL at 60 seconds, DNS rollback takes ~60 seconds. Test this in staging.
  • What data is at risk if rollback happens after writes? If users write to the new environment before a rollback occurs, those writes may not be present on the old environment. Plan for this specifically.
  • How long does the rollback window last? Set a specific end time — e.g., 72 hours post-cutover — after which the old environment is decommissioned regardless.

The Five Most Common Migration Failure Patterns

These failure patterns appear consistently in post-migration incident reports:

  1. DNS TTL not reduced. Team discovers at cutover time that the current TTL is 3600. Traffic continues hitting the old server for up to an hour post-change. During this window, configuration differences between old and new environments cause inconsistent behavior for different users.
  2. Database dump-and-restore with underestimated size. Export takes 4× longer than expected due to database growth. The maintenance window extends far past the planned window. Team must either continue through the early morning or cancel and reschedule.
  3. Environment variable differences. The new server is missing an environment variable that was set directly on the old server — never in source control. A specific feature breaks silently. Discovered by users before the team notices in monitoring.
  4. Firewall/security group differences. The new cloud provider requires explicit outbound allow rules that the old provider permitted by default. Outbound API calls to third-party services fail silently until logs are checked.
  5. SSL certificate not installed on new server. Certificate was manually installed on the old server years ago and never documented. HTTPS fails on the new server immediately after DNS cutover.

Items 3, 4, and 5 are prevented by a thorough validation phase (Phase 3) — which is why running partial real traffic through the new environment before full cutover is essential, not optional.

Frequently Asked Questions

Five phases: (1) Prepare — provision new environment, reduce DNS TTL to 60s 48–72 hours before cutover; (2) Replicate — set up live streaming database replication; (3) Validate — route partial real traffic and monitor for 24–72 hours; (4) Cutover — update DNS, monitor dashboards, keep old environment running for rollback; (5) Decommission — after 24–72 hour rollback window closes, terminate old resources.
DNS TTL is the number of seconds resolvers cache a record before re-checking. If your TTL is 3600, traffic continues hitting the old server for up to one hour after you change the record. For migrations, reduce TTL to 60 seconds at least 24–48 hours before cutover, so the short TTL is already propagated everywhere when you flip the switch. With TTL=60, traffic shifts within 60–120 seconds instead of up to an hour.
Use streaming replication, not dump-and-restore. Set up logical replication (PostgreSQL) or binary log replication (MySQL) from source to destination before cutover. Let the destination sync continuously. At cutover, verify replication lag is near zero, then switch application connections. Total consistency gap: milliseconds to seconds. Dump-and-restore requires a maintenance window proportional to database size — wrong approach for any service with availability requirements.
A tested rollback plan includes: the rollback decision window (how long after cutover the old environment stays live); specific rollback trigger criteria (error rate/latency threshold — not judgment calls); the exact DNS change to execute; the procedure for any data written to the new environment before rollback; the person with rollback authority present at the cutover. Must be tested in staging, not just documented.
The most common failures: (1) DNS TTL not reduced before cutover — up to one hour of split traffic; (2) Dump-and-restore database migration with underestimated size; (3) Environment variables set directly on old server, never documented or replicated; (4) Firewall/security group differences blocking outbound calls to third-party services; (5) SSL certificates manually installed on old server, missing from new environment. Items 3–5 are caught by routing real traffic in validation phase before full cutover.

References

  1. PostgreSQL Global Development Group (2024). Logical Replication. PostgreSQL 16 documentation, Chapter 31. Covers publication/subscription setup, monitoring replication lag via pg_stat_replication, and handling schema changes during logical replication. postgresql.org/docs/current/logical-replication.html
  2. IETF RFC 1034 (1987). Domain Names — Concepts and Facilities. Section 3.6 defines the DNS TTL field semantics. RFC 1912 §2.2 provides operational guidance recommending TTLs of 1 day for stable records, with ability to reduce temporarily for planned changes. rfc-editor.org/rfc/rfc1034