Wednesday, May 6, 2026

Live Migration Simulation War Room with Different Failure Scenarios

Live Migration Simulation War Room with Different Failure Scenarios

(Migrating From On-prem SQL Server To SQL Server on AWS EC2)


This is designed like a live migration war room scenarios, where things go wrong at specific minutes and you must react correctly.


Scenario Setup


* Database: 25TB (mission-critical financial system)


* Method: Backup + Restore + Log Shipping


* Downtime target: 20 minutes


* Migration window: Saturday 10:00 PM




SIMULATION MODE


We’ll simulate time exactly like a real migration.


You are the DBA. this article tell you what happens.


Your job: understand the correct response.



T-6 HOURS (4:00 PM)



Status: Everything looks good


* Full backup completed


* Restore started on EC2



FAILURE #1: RESTORE IS TOO SLOW


At T-4 hours, restore is only 40% complete.


What this means: You will miss your migration window.


Root Cause:


* Not enough IOPS on EC2


* Backup not split into enough files


Correct Response


Option A (Best if early enough)


* Stop restore


* Increase disk IOPS


* Restart restore


Option B (If too late)


Continue restore and:


* Delay migration window


* Inform stakeholders immediately


Wrong Move


* Ignoring the delay and hoping it finishes



T-60 MINUTES (9:00 PM)


Status: Restore completed, log shipping running


Everything seems fine.



FAILURE #2: LOG SHIPPING SILENTLY FAILED


You check logs:


```sql


SELECT * FROM msdb.dbo.restorehistory;

```


Last log applied = 30 minutes ago


What this means


EC2 is 30 minutes behind production


Root Cause:


* Copy job failed


* Network interruption


* Disk full on EC2



Correct Response:


Step 1: STOP CUTOVER PLAN


Do NOT proceed.



Step 2: Identify Missing Logs


Check source backup folder.



Step 3: Re-copy logs


```powershell


robocopy G:\Backup \\EC2-SQL\Backup *.trn

```


Step 4: Apply logs manually


```sql


RESTORE LOG YourDB FROM DISK = 'missing.trn' WITH NORECOVERY;

```



Step 5: Confirm sync is current



 Wrong Move:


* Proceeding with cutover while behind, that can cause data loss



T-15 MINUTES (9:45 PM)



Status: Sync restored, all good



FAILURE #3: ACTIVE TRANSACTIONS WON’T STOP


You try to freeze writes.


```sql


ALTER DATABASE YourDB SET READ_ONLY;

```


It hangs.



Root Cause:


* Long-running transactions still active



Correct Response:



Step 1: Identify blocking sessions


```sql


SELECT session_id, blocking_session_id

FROM sys.dm_exec_requests;

```



Step 2: Kill long transactions (carefully)


```sql


KILL <session_id>;

```



Step 3: Confirm zero active writes


Wrong Move:


* Forcing cutover without stopping transactions, which can cause inconsistent data.



T-5 MINUTES (9:55 PM)


Status: Ready for final log backup


FAILURE #4: FINAL LOG BACKUP FAILS


```sql


BACKUP LOG YourDB

```


Error:


```plaintext


BACKUP LOG cannot be performed because there is no current database backup

```


Root Cause: someone switched recovery model to SIMPLE earlier



CRITICAL SITUATION:


*You cannot continue log chain

*You cannot guarantee data consistency


Correct Response:


Option A (Best)


Abort migration, which means


* Re-enable application


* Switch DB back to READ_WRITE


Option B (If business forces migration)


* Take FULL backup immediately


```sql


BACKUP DATABASE YourDB TO DISK = 'emergency_full.bak';

```


* Restore on EC2 WITH RECOVERY


* Accept downtime with risk



Wrong Move:


* Ignoring error and continuing



T-0 (10:00 PM CUTOVER)


Status: You proceed successfully



FAILURE #5: APPLICATION FAILS


Users report:


* Login failures


* “User not found”


Root Cause:


* Missing SQL logins (SID mismatch)



Correct Response:


Fix orphan users


```sql


ALTER USER username WITH LOGIN = loginname;

```



Or recreate logins with SID


Wrong Move:


* Restarting SQL Server blindly




T+10 MINUTES



FAILURE #6: DATA MISMATCH FOUND


Finance team says: “Totals don’t match”



Root Cause:


* Final log not applied correctly


* Writes occurred during cutover


CRITICAL DECISION POINT



Correct Response:


If source DB still intact:


1. Take new log backup:


```sql


BACKUP LOG YourDB TO DISK = 'fix.trn';

```


2. Apply to EC2:


```sql


RESTORE LOG YourDB FROM DISK = 'fix.trn' WITH RECOVERY;

```



If not fixable: ROLLBACK



Wrong Move:


* Ignoring mismatch can become a business disaster



T+20 MINUTES


FAILURE #7: PERFORMANCE COLLAPSE


System is online but:


* Queries slow


* CPU 100%



Root Cause:


* Missing statistics


* Different execution plans


Correct Response:


```sql


EXEC sp_updatestats;

```


```sql


ALTER INDEX ALL ON YourTable REBUILD;

```



Wrong Move:


* Increasing instance size immediately without diagnosis



T+30 MINUTES


FINAL DECISION POINT



If everything stable: declare success



If major issue persists: execute rollback



FULL ROLLBACK SIMULATION


---


Step 1: Redirect application back


* Change DNS or connection string



Step 2: Enable writes on source


```sql


ALTER DATABASE YourDB SET READ_WRITE;

```



Step 3: Notify stakeholders



Step 4: Investigate root cause



LESSONS 



1. Most failures happen BEFORE cutover


* Restore speed


* Log sync



2. Most dangerous moment


* Final 10 minutes



3. Biggest risks


* Broken log chain


* Data inconsistency


* Missing logins



4. Golden rule


NEVER proceed if unsure


---


HOW TO PRACTICE 


You can practice this safely:



Lab Practice


1. Create test DB (100GB+)


2. Run migration


3. Introduce failures:


   * Delete a log file


   * Break log chain


   * Kill network


   * Change recovery model


Goal: Train your reaction, not just knowledge


---


The real DBAs is not judged by how smoothly things go but how well they recover when things break.

No comments:

Post a Comment

Live Migration Simulation War Room with Different Failure Scenarios

Live Migration Simulation War Room with Different Failure Scenarios ( Migrating From On-prem SQL Server To SQL Server on AWS EC2 ) This is d...