How do you handle database migration during a cloud provider move?

For stateful applications with persistent databases, the migration strategy depends on the database type. For PostgreSQL or MySQL, set up logical replication from the source to destination before cutover — the destination receives all writes as they happen, and when you're ready to cut over, you confirm the replication lag is near zero, flip application connections to the new database, and continue. For NoSQL databases, tools like mongomirror (MongoDB) or DMS (AWS Database Migration Service) provide similar streaming replication. The key is that you should never perform a dump-and-restore migration for production databases — that requires downtime. Streaming replication allows the new database to be fully in sync before any traffic is switched.

Zero-Downtime Server Migration: Moving Between Cloud Providers Without Breaking Production

Q: How do you migrate a server to a different cloud provider with zero downtime?

Zero-downtime server migration follows five phases: (1) Prepare — provision the new environment and reduce DNS TTL to 60 seconds well before cutover; (2) Replicate — set up live data synchronization from old to new environment and run in parallel; (3) Validate — run the new environment under real traffic by shadowing or using canary routing, verify health for 24–72 hours; (4) Cutover — switch DNS to point at the new environment, monitor for errors, keep old environment running for rollback window; (5) Decommission — after the rollback window closes, terminate old resources. The critical prerequisites are reducing DNS TTL early (to control propagation time) and having a tested rollback plan before starting.

Q: What is DNS TTL and why does it matter for cloud migrations?

DNS TTL (Time To Live) is the number of seconds that DNS resolvers cache a record before re-checking it. If your DNS TTL is 3600 (one hour), name servers around the world will serve the old IP for up to one hour after you change it. For a migration, this means your cutover is not instantaneous — traffic continues hitting the old server while DNS caches expire. To control this, reduce your DNS TTL to 60 seconds at least 24–48 hours before your planned cutover window. This ensures that when you update the record, traffic shifts within 60 seconds rather than up to an hour.

Q: What should a cloud migration rollback plan include?

A migration rollback plan must be tested before the actual cutover, not written as a theoretical document. It should include: the maximum rollback decision window (how long after cutover you'll keep the old environment running); the rollback trigger criteria (specific error rate, latency threshold, or health check failure that automatically triggers rollback decision); the exact DNS change to revert the cutover; the procedure for re-syncing any data written to the new environment back to the old one if the rollback occurs after writes have been accepted; and the person with authority to make the rollback call. Without testing the rollback procedure in staging, you don't know if it actually works under the conditions of a real migration.

Q: What are the most common causes of migration failures?

The most common migration failure causes are: (1) DNS TTL not reduced before cutover — traffic continues hitting old servers for up to an hour after the change; (2) Replication lag at cutover — the new database isn't fully synchronized and data written before cutover is missing; (3) Environment differences — configuration, environment variables, or dependencies that exist on the old server but weren't replicated to the new one; (4) Untested rollback — the rollback procedure doesn't work under real conditions; (5) Cutover during peak traffic — should always happen during the lowest traffic window of the week.

Server migrations between cloud providers have a reputation for going wrong — and that reputation is earned. The combination of irreversible DNS changes, stateful database synchronization, and compressed cutover windows creates conditions where mistakes become production incidents quickly.

Most migrations don't fail because of technical complexity. They fail because of predictable, avoidable mistakes that have happened enough times to be documented: not reducing DNS TTL before cutover, not setting up streaming database replication before switching traffic, not having a tested rollback path. The pattern of failures is consistent enough that it's possible to define a standard migration procedure that eliminates virtually all of them.

The Five-Phase Migration Pattern

Phase

Prepare — Environment setup and DNS TTL reduction

Provision the new cloud environment (matching the old server's configuration as closely as possible — same OS version, same software stack, same environment variables). Critically: reduce your DNS TTL to 60 seconds at this stage, not at cutover. DNS TTL changes take time to propagate through the resolver chain. If you make the change now, by the time you cut over, the short TTL will already be in effect everywhere.

⏱ Start: 48–72 hours before planned cutover

Phase

Replicate — Live data synchronization

Set up streaming replication from the old environment to the new one. For PostgreSQL, this means logical replication or pglogical. For MySQL/MariaDB, binary log replication. For Redis, RDB/AOF replication or a custom sync script. The new database receives all writes continuously from this point. Allow the replication to run for 24+ hours before moving to validation — this gives you confidence that it's stable.

⏱ Duration: 24–48 hours of stable replication before proceeding

Phase

Validate — Test under real conditions

Direct a small percentage of production traffic to the new environment using weighted DNS or a load balancer split. Monitor error rates, latency, and application behavior. Run your full integration test suite against the new environment. Check that all dependent services (external APIs, email providers, payment processors) can reach the new servers. This phase is what catches the environment differences that aren't caught by pre-migration testing.

⏱ Duration: 24–72 hours under partial real traffic

Phase

Cutover — Traffic switch with rollback readiness

Update your DNS record to point at the new server's IP. With TTL already at 60 seconds, traffic shifts globally within 60–120 seconds. Monitor dashboards actively for the first 30 minutes. Have the rollback DNS change pre-typed and ready to execute. Keep the old environment running with data still being written to it via replication (so a rollback immediately restores a consistent state without data loss).

⏱ Execute during lowest traffic window — Sunday 2–6 AM in primary market timezone

Phase

Decommission — After the rollback window closes

Keep the old environment running for 24–72 hours post-cutover. This is your rollback window — if anything unexpected surfaces after the initial healthy monitoring period, you can revert. Once the rollback window has closed and the new environment is confirmed stable, terminate old servers, cancel old account resources, and restore DNS TTL to a standard value (300–3600 seconds).

⏱ Begin: 24–72 hours after successful cutover

DNS TTL: The Most Skipped Step

DNS TTL (Time To Live) controls how long DNS resolvers around the world cache a record before re-checking it. A TTL of 3600 means that after you change your DNS record, some resolvers will continue serving the old IP for up to one hour.

For a migration, this creates a specific problem: if you change DNS at your cutover time without having pre-reduced the TTL, you have two servers receiving production traffic simultaneously — the old one serving requests from resolvers with the old TTL cached, and the new one serving requests where the new record has propagated. This dual-serving period is where data consistency issues happen.

TTL Reduction Timing

The TTL currently set on your DNS record is the propagation time. A record with TTL=86400 (24 hours) will take up to 24 hours from when you reduce it for the reduced TTL to fully propagate. Start the TTL reduction 48–72 hours before your cutover. Reduce in steps if needed: 86400 → 3600 → 300 → 60, each change taking effect over one full TTL period.

Database Migration: Streaming, Not Dump-and-Restore

The dump-and-restore approach to database migration — take a snapshot, stop writes, export, copy to new server, import, resume writes — requires a maintenance window proportional to database size. For databases over a few gigabytes, this is measured in minutes to hours of downtime. It's the wrong approach for any service with availability requirements.

Streaming replication eliminates the maintenance window by keeping the destination database synchronized with the source throughout the migration. The cutover sequence with streaming replication is:

Verify replication lag is under 1 second on the destination database
Update application configuration to connect to the destination database
Verify application is writing to destination and reads are succeeding
Stop replication (source writes can continue if you're running both environments post-cutover)

The total window where both databases may be slightly out of sync is measured in milliseconds to seconds — not hours.

PostgreSQL Replication Setup

For PostgreSQL logical replication, create a replication slot on the source, set wal_level = logical in postgresql.conf, and use CREATE PUBLICATION on the source and CREATE SUBSCRIPTION on the destination. Monitor replication lag via pg_stat_replication. For migrations from an older PostgreSQL version where logical replication isn't available, pglogical provides equivalent functionality.^[1]

The Rollback Plan: Not Optional

A rollback plan that hasn't been tested is just a document. For a migration, the rollback plan must be tested in a staging environment that mirrors the production migration conditions — not just theorized.

Before executing any production cutover, you must have answers to:

What's the rollback trigger? Define the specific error rate, latency percentile, or health check failure that triggers a rollback decision. Don't make this a judgment call under incident stress.
Who can authorize rollback? Have a named person with the authority to make the call present at the cutover keyboard.
How long will rollback take? With TTL at 60 seconds, DNS rollback takes ~60 seconds. Test this in staging.
What data is at risk if rollback happens after writes? If users write to the new environment before a rollback occurs, those writes may not be present on the old environment. Plan for this specifically.
How long does the rollback window last? Set a specific end time — e.g., 72 hours post-cutover — after which the old environment is decommissioned regardless.

The Five Most Common Migration Failure Patterns

These failure patterns appear consistently in post-migration incident reports:

DNS TTL not reduced. Team discovers at cutover time that the current TTL is 3600. Traffic continues hitting the old server for up to an hour post-change. During this window, configuration differences between old and new environments cause inconsistent behavior for different users.
Database dump-and-restore with underestimated size. Export takes 4× longer than expected due to database growth. The maintenance window extends far past the planned window. Team must either continue through the early morning or cancel and reschedule.
Environment variable differences. The new server is missing an environment variable that was set directly on the old server — never in source control. A specific feature breaks silently. Discovered by users before the team notices in monitoring.
Firewall/security group differences. The new cloud provider requires explicit outbound allow rules that the old provider permitted by default. Outbound API calls to third-party services fail silently until logs are checked.
SSL certificate not installed on new server. Certificate was manually installed on the old server years ago and never documented. HTTPS fails on the new server immediately after DNS cutover.

Items 3, 4, and 5 are prevented by a thorough validation phase (Phase 3) — which is why running partial real traffic through the new environment before full cutover is essential, not optional.

Frequently Asked Questions

How do you migrate a server to a different cloud provider with zero downtime?

Five phases: (1) Prepare — provision new environment, reduce DNS TTL to 60s 48–72 hours before cutover; (2) Replicate — set up live streaming database replication; (3) Validate — route partial real traffic and monitor for 24–72 hours; (4) Cutover — update DNS, monitor dashboards, keep old environment running for rollback; (5) Decommission — after 24–72 hour rollback window closes, terminate old resources.

What is DNS TTL and why does it matter for cloud migrations?

DNS TTL is the number of seconds resolvers cache a record before re-checking. If your TTL is 3600, traffic continues hitting the old server for up to one hour after you change the record. For migrations, reduce TTL to 60 seconds at least 24–48 hours before cutover, so the short TTL is already propagated everywhere when you flip the switch. With TTL=60, traffic shifts within 60–120 seconds instead of up to an hour.

How should database migration be handled during a cloud provider move?

Use streaming replication, not dump-and-restore. Set up logical replication (PostgreSQL) or binary log replication (MySQL) from source to destination before cutover. Let the destination sync continuously. At cutover, verify replication lag is near zero, then switch application connections. Total consistency gap: milliseconds to seconds. Dump-and-restore requires a maintenance window proportional to database size — wrong approach for any service with availability requirements.

What should a cloud migration rollback plan include?

A tested rollback plan includes: the rollback decision window (how long after cutover the old environment stays live); specific rollback trigger criteria (error rate/latency threshold — not judgment calls); the exact DNS change to execute; the procedure for any data written to the new environment before rollback; the person with rollback authority present at the cutover. Must be tested in staging, not just documented.

What are the most common causes of migration failures?

The most common failures: (1) DNS TTL not reduced before cutover — up to one hour of split traffic; (2) Dump-and-restore database migration with underestimated size; (3) Environment variables set directly on old server, never documented or replicated; (4) Firewall/security group differences blocking outbound calls to third-party services; (5) SSL certificates manually installed on old server, missing from new environment. Items 3–5 are caught by routing real traffic in validation phase before full cutover.

References

PostgreSQL Global Development Group (2024). Logical Replication. PostgreSQL 16 documentation, Chapter 31. Covers publication/subscription setup, monitoring replication lag via pg_stat_replication, and handling schema changes during logical replication. postgresql.org/docs/current/logical-replication.html
IETF RFC 1034 (1987). Domain Names — Concepts and Facilities. Section 3.6 defines the DNS TTL field semantics. RFC 1912 §2.2 provides operational guidance recommending TTLs of 1 day for stable records, with ability to reduce temporarily for planned changes. rfc-editor.org/rfc/rfc1034

The Five-Phase Migration Pattern

DNS TTL: The Most Skipped Step

Database Migration: Streaming, Not Dump-and-Restore

The Rollback Plan: Not Optional

The Five Most Common Migration Failure Patterns

Frequently Asked Questions

References

Related Articles

Migrate Servers Without the Stress