Wednesday, May 6, 2026

Live Migration Simulation War Room with Different Failure Scenarios

Live Migration Simulation War Room with Different Failure Scenarios

(Migrating From On-prem SQL Server To SQL Server on AWS EC2)


This is designed like a live migration war room scenarios, where things go wrong at specific minutes and you must react correctly.


Scenario Setup


* Database: 25TB (mission-critical financial system)


* Method: Backup + Restore + Log Shipping


* Downtime target: 20 minutes


* Migration window: Saturday 10:00 PM




SIMULATION MODE


We’ll simulate time exactly like a real migration.


You are the DBA. this article tell you what happens.


Your job: understand the correct response.



T-6 HOURS (4:00 PM)



Status: Everything looks good


* Full backup completed


* Restore started on EC2



FAILURE #1: RESTORE IS TOO SLOW


At T-4 hours, restore is only 40% complete.


What this means: You will miss your migration window.


Root Cause:


* Not enough IOPS on EC2


* Backup not split into enough files


Correct Response


Option A (Best if early enough)


* Stop restore


* Increase disk IOPS


* Restart restore


Option B (If too late)


Continue restore and:


* Delay migration window


* Inform stakeholders immediately


Wrong Move


* Ignoring the delay and hoping it finishes



T-60 MINUTES (9:00 PM)


Status: Restore completed, log shipping running


Everything seems fine.



FAILURE #2: LOG SHIPPING SILENTLY FAILED


You check logs:


```sql


SELECT * FROM msdb.dbo.restorehistory;

```


Last log applied = 30 minutes ago


What this means


EC2 is 30 minutes behind production


Root Cause:


* Copy job failed


* Network interruption


* Disk full on EC2



Correct Response:


Step 1: STOP CUTOVER PLAN


Do NOT proceed.



Step 2: Identify Missing Logs


Check source backup folder.



Step 3: Re-copy logs


```powershell


robocopy G:\Backup \\EC2-SQL\Backup *.trn

```


Step 4: Apply logs manually


```sql


RESTORE LOG YourDB FROM DISK = 'missing.trn' WITH NORECOVERY;

```



Step 5: Confirm sync is current



 Wrong Move:


* Proceeding with cutover while behind, that can cause data loss



T-15 MINUTES (9:45 PM)



Status: Sync restored, all good



FAILURE #3: ACTIVE TRANSACTIONS WON’T STOP


You try to freeze writes.


```sql


ALTER DATABASE YourDB SET READ_ONLY;

```


It hangs.



Root Cause:


* Long-running transactions still active



Correct Response:



Step 1: Identify blocking sessions


```sql


SELECT session_id, blocking_session_id

FROM sys.dm_exec_requests;

```



Step 2: Kill long transactions (carefully)


```sql


KILL <session_id>;

```



Step 3: Confirm zero active writes


Wrong Move:


* Forcing cutover without stopping transactions, which can cause inconsistent data.



T-5 MINUTES (9:55 PM)


Status: Ready for final log backup


FAILURE #4: FINAL LOG BACKUP FAILS


```sql


BACKUP LOG YourDB

```


Error:


```plaintext


BACKUP LOG cannot be performed because there is no current database backup

```


Root Cause: someone switched recovery model to SIMPLE earlier



CRITICAL SITUATION:


*You cannot continue log chain

*You cannot guarantee data consistency


Correct Response:


Option A (Best)


Abort migration, which means


* Re-enable application


* Switch DB back to READ_WRITE


Option B (If business forces migration)


* Take FULL backup immediately


```sql


BACKUP DATABASE YourDB TO DISK = 'emergency_full.bak';

```


* Restore on EC2 WITH RECOVERY


* Accept downtime with risk



Wrong Move:


* Ignoring error and continuing



T-0 (10:00 PM CUTOVER)


Status: You proceed successfully



FAILURE #5: APPLICATION FAILS


Users report:


* Login failures


* “User not found”


Root Cause:


* Missing SQL logins (SID mismatch)



Correct Response:


Fix orphan users


```sql


ALTER USER username WITH LOGIN = loginname;

```



Or recreate logins with SID


Wrong Move:


* Restarting SQL Server blindly




T+10 MINUTES



FAILURE #6: DATA MISMATCH FOUND


Finance team says: “Totals don’t match”



Root Cause:


* Final log not applied correctly


* Writes occurred during cutover


CRITICAL DECISION POINT



Correct Response:


If source DB still intact:


1. Take new log backup:


```sql


BACKUP LOG YourDB TO DISK = 'fix.trn';

```


2. Apply to EC2:


```sql


RESTORE LOG YourDB FROM DISK = 'fix.trn' WITH RECOVERY;

```



If not fixable: ROLLBACK



Wrong Move:


* Ignoring mismatch can become a business disaster



T+20 MINUTES


FAILURE #7: PERFORMANCE COLLAPSE


System is online but:


* Queries slow


* CPU 100%



Root Cause:


* Missing statistics


* Different execution plans


Correct Response:


```sql


EXEC sp_updatestats;

```


```sql


ALTER INDEX ALL ON YourTable REBUILD;

```



Wrong Move:


* Increasing instance size immediately without diagnosis



T+30 MINUTES


FINAL DECISION POINT



If everything stable: declare success



If major issue persists: execute rollback



FULL ROLLBACK SIMULATION


---


Step 1: Redirect application back


* Change DNS or connection string



Step 2: Enable writes on source


```sql


ALTER DATABASE YourDB SET READ_WRITE;

```



Step 3: Notify stakeholders



Step 4: Investigate root cause



LESSONS 



1. Most failures happen BEFORE cutover


* Restore speed


* Log sync



2. Most dangerous moment


* Final 10 minutes



3. Biggest risks


* Broken log chain


* Data inconsistency


* Missing logins



4. Golden rule


NEVER proceed if unsure


---


HOW TO PRACTICE 


You can practice this safely:



Lab Practice


1. Create test DB (100GB+)


2. Run migration


3. Introduce failures:


   * Delete a log file


   * Break log chain


   * Kill network


   * Change recovery model


Goal: Train your reaction, not just knowledge


---


The real DBAs is not judged by how smoothly things go but how well they recover when things break.

No comments:

Post a Comment

MINUTE BY MINUITE PRODUCTION RUNBOOK FOR FULLY AUTOMATED MIGRATION FROM SAP ASE TO SQL Server Azure VM

MINUTE BY MINUITE PRODUCTION RUNBOOK FOR  FULLY AUTOMATED MIGRATION FROM SAP ASE TO SQL Server Azure VM --- OVERALL STRUCTURE Breaking execu...