Thursday, May 7, 2026

Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB)

 Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB)


More importantly:


*How to detect Failures early

*How to recover safely

*How to prevent Failures next time


No theory—this is what breaks in real life.



SCENARIO 1: TRANSACTION LOG CHAIN BROKEN



What Happens:


During log shipping or continuous backup:


* Someone runs a FULL backup outside your process


* Or switches recovery model (FULL → SIMPLE)


Result: You can no longer restore logs on EC2



Symptoms


On restore:


```sql


RESTORE LOG YourDB 

FROM DISK = 'file.trn';

```


Error:


```

This log backup cannot be applied because it is out of sequence

```



Root Cause


* Log chain continuity is broken

* LSN mismatch



Recovery (Real Fix)



Option 1 (Best): Restart Sync


1. Take new full backup:


```sql


BACKUP DATABASE YourDB TO DISK = 'new_full.bak' WITH COMPRESSION;

```


2. Restore again on EC2:


```sql


RESTORE DATABASE YourDB 

FROM DISK = 'new_full.bak'

WITH NORECOVERY;

```


3. Resume log backups



Option 2 (Emergency)


If downtime is acceptable:


* Take final full backup

* Restore WITH RECOVERY

* Switch immediately



Prevention


* Restrict backup permissions

* Use dedicated backup jobs only

* Monitor LSN continuity



SCENARIO 2: MISSING LOG FILE DURING MIGRATION



What Happens


One `.trn` file fails to copy.



Symptoms


```sql


RESTORE LOG YourDB

```


Error:



The log in this backup set begins at LSN X, which is too recent

```



Root Cause


* Missing log file in sequence



Recovery


Step 1: Identify Missing File


Check restore history:


```sql


SELECT * FROM msdb.dbo.restorehistory

ORDER BY restore_date DESC;

```



Step 2: Re-copy Missing File


Use:


```powershell


robocopy G:\Backup \\EC2\Backup missing_file.trn

```



Step 3: Restore Again



Prevention


* Never delete `.trn` files until applied

* Use checksum validation

* Use automated tracking



SCENARIO 3: RESTORE TAKES TOO LONG (10–100TB)



What Happens


Restore runs for:


* 10+ hours

* 24+ hours



Root Cause


* Low IOPS

* Single backup file

* Small instance size



Recovery (During Migration)


You cannot speed up a running restore much.


Best move:


* Let it finish

* Optimize next run



Prevention (CRITICAL)


Use Multi-File Backup


```sql


BACKUP DATABASE YourDB

TO DISK = 'file1.bak', DISK = 'file2.bak', DISK = 'file3.bak'

WITH COMPRESSION;

```



Increase AWS Storage Performance


* Use io2 or high IOPS gp3

* Pre-provision IOPS



Use Larger EC2 Instance


* Memory + CPU matters



SCENARIO 4: DISK FULL ON EC2



What Happens


Restore fails midway:


```

There is not enough space on disk

```



Recovery


Step 1: Add New EBS Volume


* Attach to EC2

* Format disk



Step 2: Move Files


```sql


ALTER DATABASE YourDB 

MODIFY FILE (NAME = YourDB_Data, FILENAME = 'NewPath\YourDB.mdf');

```



Step 3: Restart Restore



Prevention


* Always calculate: DB size × 1.5 (minimum space)




SCENARIO 5: APPLICATION FAILS AFTER CUTOVER


What Happens


Migration succeeded, but:


* App errors

* Login failures

* Queries break



Root Causes


* Missing logins

* Orphan users

* Connection strings not updated



Recovery


Fix Logins


```sql


ALTER USER username WITH LOGIN = loginname;

```



Recreate Logins with SID


Use login script from source.


Fix Connection String


Point app to EC2 hostname.



Prevention


* Test application BEFORE cutover

* Script all logins




SCENARIO 6: DATA MISMATCH AFTER MIGRATION



What Happens


* Row counts differ

* Financial data incorrect



Root Causes


* Missed final log backup

* Writes during cutover



Recovery



If Source Still Running


1. Take final log backup:


```sql


BACKUP LOG YourDB TO DISK = 'final.trn';

```


2. Apply to EC2:


```sql


RESTORE LOG YourDB FROM DISK = 'final.trn' WITH RECOVERY;

```



If Already Switched



Critical situation:


* Compare data

* Manually reconcile

* Possibly rollback



Prevention


* Freeze writes before final backup

* Validate row counts




SCENARIO 7: AWS NETWORK BOTTLENECK



What Happens


* Transfer extremely slow

* Migration takes days



Root Cause


* Limited bandwidth

* No Direct Connect



Recovery


* Switch to AWS DataSync

* Use parallel copy



Prevention


* Test network speed early

* Use multiple streams



SCENARIO 8: ALWAYS ON FAILOVER DOES NOT WORK



What Happens


Failover fails:


```

The availability group is not synchronized

```



Root Cause


* Secondary not synced

* Network issue



Recovery


Check Sync State


```sql


SELECT synchronization_state_desc

FROM sys.dm_hadr_database_replica_states;

```



Fix:


* Resume data movement

* Re-seed database if needed



Prevention


* Monitor sync continuously

* Use synchronous commit for critical DBs



SCENARIO 9: HIGH CPU AFTER MIGRATION



What Happens


System becomes slow after moving to EC2.



Root Cause


* Missing indexes

* Old statistics

* Different hardware profile



Recovery


```sql


EXEC sp_updatestats;

```


```sql


ALTER INDEX ALL ON YourTable REBUILD;

```



Prevention


* Run tuning after migration

* Capture baseline before migration



SCENARIO 10: TOTAL MIGRATION FAILURE (ROLLBACK)



What Happens


Everything goes wrong:


* Data mismatch

* App failure

* Performance collapse


Recovery (Your Lifeline)


Rollback plan


Step 1: Switch Application Back


* Point to on-prem SQL Server



Step 2: Keep EC2 for analysis


DO NOT delete it.



Step 3: Investigate


* Logs

* Data differences

* Performance metrics



Prevention


* Always keep source system intact

* Never delete original until stable



MASTER FAILURE RECOVERY MINDSET


When something fails:



Step 1: STOP


Do not rush fixes.



Step 2: IDENTIFY


* What failed?

* Where?

* Why?



Step 3: ISOLATE


* Data issue?

* Network?

* SQL Server?



Step 4: RECOVER


* Use backups

* Reapply logs

* Rollback if needed



Step 5: DOCUMENT


* What happened

* Fix applied

* Prevention next time


---


For 10TB–100TB migrations, expect:


* Something WILL fail


* The difference is: Prepared DBA will have smooth recovery whereas unprepared DBA will face disaster.


No comments:

Post a Comment

Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB)

 Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB) More importantly: *How to detect Fa...