Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB)
More importantly:
*How to detect Failures early
*How to recover safely
*How to prevent Failures next time
No theory—this is what breaks in real life.
SCENARIO 1: TRANSACTION LOG CHAIN BROKEN
What Happens:
During log shipping or continuous backup:
* Someone runs a FULL backup outside your process
* Or switches recovery model (FULL → SIMPLE)
Result: You can no longer restore logs on EC2
Symptoms
On restore:
```sql
RESTORE LOG YourDB
FROM DISK = 'file.trn';
```
Error:
```
This log backup cannot be applied because it is out of sequence
```
Root Cause
* Log chain continuity is broken
* LSN mismatch
Recovery (Real Fix)
Option 1 (Best): Restart Sync
1. Take new full backup:
```sql
BACKUP DATABASE YourDB TO DISK = 'new_full.bak' WITH COMPRESSION;
```
2. Restore again on EC2:
```sql
RESTORE DATABASE YourDB
FROM DISK = 'new_full.bak'
WITH NORECOVERY;
```
3. Resume log backups
Option 2 (Emergency)
If downtime is acceptable:
* Take final full backup
* Restore WITH RECOVERY
* Switch immediately
Prevention
* Restrict backup permissions
* Use dedicated backup jobs only
* Monitor LSN continuity
SCENARIO 2: MISSING LOG FILE DURING MIGRATION
What Happens
One `.trn` file fails to copy.
Symptoms
```sql
RESTORE LOG YourDB
```
Error:
The log in this backup set begins at LSN X, which is too recent
```
Root Cause
* Missing log file in sequence
Recovery
Step 1: Identify Missing File
Check restore history:
```sql
SELECT * FROM msdb.dbo.restorehistory
ORDER BY restore_date DESC;
```
Step 2: Re-copy Missing File
Use:
```powershell
robocopy G:\Backup \\EC2\Backup missing_file.trn
```
Step 3: Restore Again
Prevention
* Never delete `.trn` files until applied
* Use checksum validation
* Use automated tracking
SCENARIO 3: RESTORE TAKES TOO LONG (10–100TB)
What Happens
Restore runs for:
* 10+ hours
* 24+ hours
Root Cause
* Low IOPS
* Single backup file
* Small instance size
Recovery (During Migration)
You cannot speed up a running restore much.
Best move:
* Let it finish
* Optimize next run
Prevention (CRITICAL)
Use Multi-File Backup
```sql
BACKUP DATABASE YourDB
TO DISK = 'file1.bak', DISK = 'file2.bak', DISK = 'file3.bak'
WITH COMPRESSION;
```
Increase AWS Storage Performance
* Use io2 or high IOPS gp3
* Pre-provision IOPS
Use Larger EC2 Instance
* Memory + CPU matters
SCENARIO 4: DISK FULL ON EC2
What Happens
Restore fails midway:
```
There is not enough space on disk
```
Recovery
Step 1: Add New EBS Volume
* Attach to EC2
* Format disk
Step 2: Move Files
```sql
ALTER DATABASE YourDB
MODIFY FILE (NAME = YourDB_Data, FILENAME = 'NewPath\YourDB.mdf');
```
Step 3: Restart Restore
Prevention
* Always calculate: DB size × 1.5 (minimum space)
SCENARIO 5: APPLICATION FAILS AFTER CUTOVER
What Happens
Migration succeeded, but:
* App errors
* Login failures
* Queries break
Root Causes
* Missing logins
* Orphan users
* Connection strings not updated
Recovery
Fix Logins
```sql
ALTER USER username WITH LOGIN = loginname;
```
Recreate Logins with SID
Use login script from source.
Fix Connection String
Point app to EC2 hostname.
Prevention
* Test application BEFORE cutover
* Script all logins
SCENARIO 6: DATA MISMATCH AFTER MIGRATION
What Happens
* Row counts differ
* Financial data incorrect
Root Causes
* Missed final log backup
* Writes during cutover
Recovery
If Source Still Running
1. Take final log backup:
```sql
BACKUP LOG YourDB TO DISK = 'final.trn';
```
2. Apply to EC2:
```sql
RESTORE LOG YourDB FROM DISK = 'final.trn' WITH RECOVERY;
```
If Already Switched
Critical situation:
* Compare data
* Manually reconcile
* Possibly rollback
Prevention
* Freeze writes before final backup
* Validate row counts
SCENARIO 7: AWS NETWORK BOTTLENECK
What Happens
* Transfer extremely slow
* Migration takes days
Root Cause
* Limited bandwidth
* No Direct Connect
Recovery
* Switch to AWS DataSync
* Use parallel copy
Prevention
* Test network speed early
* Use multiple streams
SCENARIO 8: ALWAYS ON FAILOVER DOES NOT WORK
What Happens
Failover fails:
```
The availability group is not synchronized
```
Root Cause
* Secondary not synced
* Network issue
Recovery
Check Sync State
```sql
SELECT synchronization_state_desc
FROM sys.dm_hadr_database_replica_states;
```
Fix:
* Resume data movement
* Re-seed database if needed
Prevention
* Monitor sync continuously
* Use synchronous commit for critical DBs
SCENARIO 9: HIGH CPU AFTER MIGRATION
What Happens
System becomes slow after moving to EC2.
Root Cause
* Missing indexes
* Old statistics
* Different hardware profile
Recovery
```sql
EXEC sp_updatestats;
```
```sql
ALTER INDEX ALL ON YourTable REBUILD;
```
Prevention
* Run tuning after migration
* Capture baseline before migration
SCENARIO 10: TOTAL MIGRATION FAILURE (ROLLBACK)
What Happens
Everything goes wrong:
* Data mismatch
* App failure
* Performance collapse
Recovery (Your Lifeline)
Rollback plan
Step 1: Switch Application Back
* Point to on-prem SQL Server
Step 2: Keep EC2 for analysis
DO NOT delete it.
Step 3: Investigate
* Logs
* Data differences
* Performance metrics
Prevention
* Always keep source system intact
* Never delete original until stable
MASTER FAILURE RECOVERY MINDSET
When something fails:
Step 1: STOP
Do not rush fixes.
Step 2: IDENTIFY
* What failed?
* Where?
* Why?
Step 3: ISOLATE
* Data issue?
* Network?
* SQL Server?
Step 4: RECOVER
* Use backups
* Reapply logs
* Rollback if needed
Step 5: DOCUMENT
* What happened
* Fix applied
* Prevention next time
---
For 10TB–100TB migrations, expect:
* Something WILL fail
* The difference is: Prepared DBA will have smooth recovery whereas unprepared DBA will face disaster.