Live Migration Simulation War Room with Different Failure Scenarios
(Migrating From On-prem SQL Server To SQL Server on AWS EC2)
This is designed like a live migration war room scenarios, where things go wrong at specific minutes and you must react correctly.
Scenario Setup
* Database: 25TB (mission-critical financial system)
* Method: Backup + Restore + Log Shipping
* Downtime target: 20 minutes
* Migration window: Saturday 10:00 PM
SIMULATION MODE
We’ll simulate time exactly like a real migration.
You are the DBA. this article tell you what happens.
Your job: understand the correct response.
T-6 HOURS (4:00 PM)
Status: Everything looks good
* Full backup completed
* Restore started on EC2
FAILURE #1: RESTORE IS TOO SLOW
At T-4 hours, restore is only 40% complete.
What this means: You will miss your migration window.
Root Cause:
* Not enough IOPS on EC2
* Backup not split into enough files
Correct Response
Option A (Best if early enough)
* Stop restore
* Increase disk IOPS
* Restart restore
Option B (If too late)
Continue restore and:
* Delay migration window
* Inform stakeholders immediately
Wrong Move
* Ignoring the delay and hoping it finishes
T-60 MINUTES (9:00 PM)
Status: Restore completed, log shipping running
Everything seems fine.
FAILURE #2: LOG SHIPPING SILENTLY FAILED
You check logs:
```sql
SELECT * FROM msdb.dbo.restorehistory;
```
Last log applied = 30 minutes ago
What this means
EC2 is 30 minutes behind production
Root Cause:
* Copy job failed
* Network interruption
* Disk full on EC2
Correct Response:
Step 1: STOP CUTOVER PLAN
Do NOT proceed.
Step 2: Identify Missing Logs
Check source backup folder.
Step 3: Re-copy logs
```powershell
robocopy G:\Backup \\EC2-SQL\Backup *.trn
```
Step 4: Apply logs manually
```sql
RESTORE LOG YourDB FROM DISK = 'missing.trn' WITH NORECOVERY;
```
Step 5: Confirm sync is current
Wrong Move:
* Proceeding with cutover while behind, that can cause data loss
T-15 MINUTES (9:45 PM)
Status: Sync restored, all good
FAILURE #3: ACTIVE TRANSACTIONS WON’T STOP
You try to freeze writes.
```sql
ALTER DATABASE YourDB SET READ_ONLY;
```
It hangs.
Root Cause:
* Long-running transactions still active
Correct Response:
Step 1: Identify blocking sessions
```sql
SELECT session_id, blocking_session_id
FROM sys.dm_exec_requests;
```
Step 2: Kill long transactions (carefully)
```sql
KILL <session_id>;
```
Step 3: Confirm zero active writes
Wrong Move:
* Forcing cutover without stopping transactions, which can cause inconsistent data.
T-5 MINUTES (9:55 PM)
Status: Ready for final log backup
FAILURE #4: FINAL LOG BACKUP FAILS
```sql
BACKUP LOG YourDB
```
Error:
```plaintext
BACKUP LOG cannot be performed because there is no current database backup
```
Root Cause: someone switched recovery model to SIMPLE earlier
CRITICAL SITUATION:
*You cannot continue log chain
*You cannot guarantee data consistency
Correct Response:
Option A (Best)
Abort migration, which means
* Re-enable application
* Switch DB back to READ_WRITE
Option B (If business forces migration)
* Take FULL backup immediately
```sql
BACKUP DATABASE YourDB TO DISK = 'emergency_full.bak';
```
* Restore on EC2 WITH RECOVERY
* Accept downtime with risk
Wrong Move:
* Ignoring error and continuing
T-0 (10:00 PM CUTOVER)
Status: You proceed successfully
FAILURE #5: APPLICATION FAILS
Users report:
* Login failures
* “User not found”
Root Cause:
* Missing SQL logins (SID mismatch)
Correct Response:
Fix orphan users
```sql
ALTER USER username WITH LOGIN = loginname;
```
Or recreate logins with SID
Wrong Move:
* Restarting SQL Server blindly
T+10 MINUTES
FAILURE #6: DATA MISMATCH FOUND
Finance team says: “Totals don’t match”
Root Cause:
* Final log not applied correctly
* Writes occurred during cutover
CRITICAL DECISION POINT
Correct Response:
If source DB still intact:
1. Take new log backup:
```sql
BACKUP LOG YourDB TO DISK = 'fix.trn';
```
2. Apply to EC2:
```sql
RESTORE LOG YourDB FROM DISK = 'fix.trn' WITH RECOVERY;
```
If not fixable: ROLLBACK
Wrong Move:
* Ignoring mismatch can become a business disaster
T+20 MINUTES
FAILURE #7: PERFORMANCE COLLAPSE
System is online but:
* Queries slow
* CPU 100%
Root Cause:
* Missing statistics
* Different execution plans
Correct Response:
```sql
EXEC sp_updatestats;
```
```sql
ALTER INDEX ALL ON YourTable REBUILD;
```
Wrong Move:
* Increasing instance size immediately without diagnosis
T+30 MINUTES
FINAL DECISION POINT
If everything stable: declare success
If major issue persists: execute rollback
FULL ROLLBACK SIMULATION
---
Step 1: Redirect application back
* Change DNS or connection string
Step 2: Enable writes on source
```sql
ALTER DATABASE YourDB SET READ_WRITE;
```
Step 3: Notify stakeholders
Step 4: Investigate root cause
LESSONS
1. Most failures happen BEFORE cutover
* Restore speed
* Log sync
2. Most dangerous moment
* Final 10 minutes
3. Biggest risks
* Broken log chain
* Data inconsistency
* Missing logins
4. Golden rule
NEVER proceed if unsure
---
HOW TO PRACTICE
You can practice this safely:
Lab Practice
1. Create test DB (100GB+)
2. Run migration
3. Introduce failures:
* Delete a log file
* Break log chain
* Kill network
* Change recovery model
Goal: Train your reaction, not just knowledge
---
The real DBAs is not judged by how smoothly things go but how well they recover when things break.
No comments:
Post a Comment