Making-sense of the Evolution of Data, and Database Technology: Live Migration Simulation War Room with Different Failure Scenarios

Live Migration Simulation War Room with Different Failure Scenarios

(Migrating From On-prem SQL Server To SQL Server on AWS EC2)

This is designed like a live migration war room scenarios, where things go wrong at specific minutes and you must react correctly.

Scenario Setup

* Database: 25TB (mission-critical financial system)

* Method: Backup + Restore + Log Shipping

* Downtime target: 20 minutes

* Migration window: Saturday 10:00 PM

SIMULATION MODE

We’ll simulate time exactly like a real migration.

You are the DBA. this article tell you what happens.

Your job: understand the correct response.

T-6 HOURS (4:00 PM)

Status: Everything looks good

* Full backup completed

* Restore started on EC2

FAILURE #1: RESTORE IS TOO SLOW

At T-4 hours, restore is only 40% complete.

What this means: You will miss your migration window.

Root Cause:

* Not enough IOPS on EC2

* Backup not split into enough files

Correct Response

Option A (Best if early enough)

* Stop restore

* Increase disk IOPS

* Restart restore

Option B (If too late)

Continue restore and:

* Delay migration window

* Inform stakeholders immediately

Wrong Move

* Ignoring the delay and hoping it finishes

T-60 MINUTES (9:00 PM)

Status: Restore completed, log shipping running

Everything seems fine.

FAILURE #2: LOG SHIPPING SILENTLY FAILED

You check logs:

```sql

SELECT * FROM msdb.dbo.restorehistory;

```

Last log applied = 30 minutes ago

What this means

EC2 is 30 minutes behind production

Root Cause:

* Copy job failed

* Network interruption

* Disk full on EC2

Correct Response:

Step 1: STOP CUTOVER PLAN

Do NOT proceed.

Step 2: Identify Missing Logs

Check source backup folder.

Step 3: Re-copy logs

```powershell

robocopy G:\Backup \\EC2-SQL\Backup *.trn

```

Step 4: Apply logs manually

```sql

RESTORE LOG YourDB FROM DISK = 'missing.trn' WITH NORECOVERY;

```

Step 5: Confirm sync is current

Wrong Move:

* Proceeding with cutover while behind, that can cause data loss

T-15 MINUTES (9:45 PM)

Status: Sync restored, all good

FAILURE #3: ACTIVE TRANSACTIONS WON’T STOP

You try to freeze writes.

```sql

ALTER DATABASE YourDB SET READ_ONLY;

```

It hangs.

Root Cause:

* Long-running transactions still active

Correct Response:

Step 1: Identify blocking sessions

```sql

SELECT session_id, blocking_session_id

FROM sys.dm_exec_requests;

```

Step 2: Kill long transactions (carefully)

```sql

KILL <session_id>;

```

Step 3: Confirm zero active writes

Wrong Move:

* Forcing cutover without stopping transactions, which can cause inconsistent data.

T-5 MINUTES (9:55 PM)

Status: Ready for final log backup

FAILURE #4: FINAL LOG BACKUP FAILS

```sql

BACKUP LOG YourDB

```

Error:

```plaintext

BACKUP LOG cannot be performed because there is no current database backup

```

Root Cause: someone switched recovery model to SIMPLE earlier

CRITICAL SITUATION:

*You cannot continue log chain

*You cannot guarantee data consistency

Correct Response:

Option A (Best)

Abort migration, which means

* Re-enable application

* Switch DB back to READ_WRITE

Option B (If business forces migration)

* Take FULL backup immediately

```sql

BACKUP DATABASE YourDB TO DISK = 'emergency_full.bak';

```

* Restore on EC2 WITH RECOVERY

* Accept downtime with risk

Wrong Move:

* Ignoring error and continuing

T-0 (10:00 PM CUTOVER)

Status: You proceed successfully

FAILURE #5: APPLICATION FAILS

Users report:

* Login failures

* “User not found”

Root Cause:

* Missing SQL logins (SID mismatch)

Correct Response:

Fix orphan users

```sql

ALTER USER username WITH LOGIN = loginname;

```

Or recreate logins with SID

Wrong Move:

* Restarting SQL Server blindly

T+10 MINUTES

FAILURE #6: DATA MISMATCH FOUND

Finance team says: “Totals don’t match”

Root Cause:

* Final log not applied correctly

* Writes occurred during cutover

CRITICAL DECISION POINT

Correct Response:

If source DB still intact:

1. Take new log backup:

```sql

BACKUP LOG YourDB TO DISK = 'fix.trn';

```

2. Apply to EC2:

```sql

RESTORE LOG YourDB FROM DISK = 'fix.trn' WITH RECOVERY;

```

If not fixable: ROLLBACK

Wrong Move:

* Ignoring mismatch can become a business disaster

T+20 MINUTES

FAILURE #7: PERFORMANCE COLLAPSE

System is online but:

* Queries slow

* CPU 100%

Root Cause:

* Missing statistics

* Different execution plans

Correct Response:

```sql

EXEC sp_updatestats;

```

```sql

ALTER INDEX ALL ON YourTable REBUILD;

```

Wrong Move:

* Increasing instance size immediately without diagnosis

T+30 MINUTES

FINAL DECISION POINT

If everything stable: declare success

If major issue persists: execute rollback

FULL ROLLBACK SIMULATION

---

Step 1: Redirect application back

* Change DNS or connection string

Step 2: Enable writes on source

```sql

ALTER DATABASE YourDB SET READ_WRITE;

```

Step 3: Notify stakeholders

Step 4: Investigate root cause

LESSONS

1. Most failures happen BEFORE cutover

* Restore speed

* Log sync

2. Most dangerous moment

* Final 10 minutes

3. Biggest risks

* Broken log chain

* Data inconsistency

* Missing logins

4. Golden rule

NEVER proceed if unsure

---

HOW TO PRACTICE

You can practice this safely:

Lab Practice

1. Create test DB (100GB+)

2. Run migration

3. Introduce failures:

* Delete a log file

* Break log chain

* Kill network

* Change recovery model

Goal: Train your reaction, not just knowledge

---

The real DBAs is not judged by how smoothly things go but how well they recover when things break.

Making-sense of the Evolution of Data, and Database Technology

Wednesday, May 6, 2026

Live Migration Simulation War Room with Different Failure Scenarios

No comments:

Post a Comment

Live Migration Simulation War Room with Different Failure Scenarios

Report Abuse

Labels