Thursday, May 7, 2026

Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB)

 Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB)


More importantly:


*How to detect Failures early

*How to recover safely

*How to prevent Failures next time


No theory—this is what breaks in real life.



SCENARIO 1: TRANSACTION LOG CHAIN BROKEN



What Happens:


During log shipping or continuous backup:


* Someone runs a FULL backup outside your process


* Or switches recovery model (FULL → SIMPLE)


Result: You can no longer restore logs on EC2



Symptoms


On restore:


```sql


RESTORE LOG YourDB 

FROM DISK = 'file.trn';

```


Error:


```

This log backup cannot be applied because it is out of sequence

```



Root Cause


* Log chain continuity is broken

* LSN mismatch



Recovery (Real Fix)



Option 1 (Best): Restart Sync


1. Take new full backup:


```sql


BACKUP DATABASE YourDB TO DISK = 'new_full.bak' WITH COMPRESSION;

```


2. Restore again on EC2:


```sql


RESTORE DATABASE YourDB 

FROM DISK = 'new_full.bak'

WITH NORECOVERY;

```


3. Resume log backups



Option 2 (Emergency)


If downtime is acceptable:


* Take final full backup

* Restore WITH RECOVERY

* Switch immediately



Prevention


* Restrict backup permissions

* Use dedicated backup jobs only

* Monitor LSN continuity



SCENARIO 2: MISSING LOG FILE DURING MIGRATION



What Happens


One `.trn` file fails to copy.



Symptoms


```sql


RESTORE LOG YourDB

```


Error:



The log in this backup set begins at LSN X, which is too recent

```



Root Cause


* Missing log file in sequence



Recovery


Step 1: Identify Missing File


Check restore history:


```sql


SELECT * FROM msdb.dbo.restorehistory

ORDER BY restore_date DESC;

```



Step 2: Re-copy Missing File


Use:


```powershell


robocopy G:\Backup \\EC2\Backup missing_file.trn

```



Step 3: Restore Again



Prevention


* Never delete `.trn` files until applied

* Use checksum validation

* Use automated tracking



SCENARIO 3: RESTORE TAKES TOO LONG (10–100TB)



What Happens


Restore runs for:


* 10+ hours

* 24+ hours



Root Cause


* Low IOPS

* Single backup file

* Small instance size



Recovery (During Migration)


You cannot speed up a running restore much.


Best move:


* Let it finish

* Optimize next run



Prevention (CRITICAL)


Use Multi-File Backup


```sql


BACKUP DATABASE YourDB

TO DISK = 'file1.bak', DISK = 'file2.bak', DISK = 'file3.bak'

WITH COMPRESSION;

```



Increase AWS Storage Performance


* Use io2 or high IOPS gp3

* Pre-provision IOPS



Use Larger EC2 Instance


* Memory + CPU matters



SCENARIO 4: DISK FULL ON EC2



What Happens


Restore fails midway:


```

There is not enough space on disk

```



Recovery


Step 1: Add New EBS Volume


* Attach to EC2

* Format disk



Step 2: Move Files


```sql


ALTER DATABASE YourDB 

MODIFY FILE (NAME = YourDB_Data, FILENAME = 'NewPath\YourDB.mdf');

```



Step 3: Restart Restore



Prevention


* Always calculate: DB size × 1.5 (minimum space)




SCENARIO 5: APPLICATION FAILS AFTER CUTOVER


What Happens


Migration succeeded, but:


* App errors

* Login failures

* Queries break



Root Causes


* Missing logins

* Orphan users

* Connection strings not updated



Recovery


Fix Logins


```sql


ALTER USER username WITH LOGIN = loginname;

```



Recreate Logins with SID


Use login script from source.


Fix Connection String


Point app to EC2 hostname.



Prevention


* Test application BEFORE cutover

* Script all logins




SCENARIO 6: DATA MISMATCH AFTER MIGRATION



What Happens


* Row counts differ

* Financial data incorrect



Root Causes


* Missed final log backup

* Writes during cutover



Recovery



If Source Still Running


1. Take final log backup:


```sql


BACKUP LOG YourDB TO DISK = 'final.trn';

```


2. Apply to EC2:


```sql


RESTORE LOG YourDB FROM DISK = 'final.trn' WITH RECOVERY;

```



If Already Switched



Critical situation:


* Compare data

* Manually reconcile

* Possibly rollback



Prevention


* Freeze writes before final backup

* Validate row counts




SCENARIO 7: AWS NETWORK BOTTLENECK



What Happens


* Transfer extremely slow

* Migration takes days



Root Cause


* Limited bandwidth

* No Direct Connect



Recovery


* Switch to AWS DataSync

* Use parallel copy



Prevention


* Test network speed early

* Use multiple streams



SCENARIO 8: ALWAYS ON FAILOVER DOES NOT WORK



What Happens


Failover fails:


```

The availability group is not synchronized

```



Root Cause


* Secondary not synced

* Network issue



Recovery


Check Sync State


```sql


SELECT synchronization_state_desc

FROM sys.dm_hadr_database_replica_states;

```



Fix:


* Resume data movement

* Re-seed database if needed



Prevention


* Monitor sync continuously

* Use synchronous commit for critical DBs



SCENARIO 9: HIGH CPU AFTER MIGRATION



What Happens


System becomes slow after moving to EC2.



Root Cause


* Missing indexes

* Old statistics

* Different hardware profile



Recovery


```sql


EXEC sp_updatestats;

```


```sql


ALTER INDEX ALL ON YourTable REBUILD;

```



Prevention


* Run tuning after migration

* Capture baseline before migration



SCENARIO 10: TOTAL MIGRATION FAILURE (ROLLBACK)



What Happens


Everything goes wrong:


* Data mismatch

* App failure

* Performance collapse


Recovery (Your Lifeline)


Rollback plan


Step 1: Switch Application Back


* Point to on-prem SQL Server



Step 2: Keep EC2 for analysis


DO NOT delete it.



Step 3: Investigate


* Logs

* Data differences

* Performance metrics



Prevention


* Always keep source system intact

* Never delete original until stable



MASTER FAILURE RECOVERY MINDSET


When something fails:



Step 1: STOP


Do not rush fixes.



Step 2: IDENTIFY


* What failed?

* Where?

* Why?



Step 3: ISOLATE


* Data issue?

* Network?

* SQL Server?



Step 4: RECOVER


* Use backups

* Reapply logs

* Rollback if needed



Step 5: DOCUMENT


* What happened

* Fix applied

* Prevention next time


---


For 10TB–100TB migrations, expect:


* Something WILL fail


* The difference is: Prepared DBA will have smooth recovery whereas unprepared DBA will face disaster.


No comments:

Post a Comment

MINUTE BY MINUITE PRODUCTION RUNBOOK FOR FULLY AUTOMATED MIGRATION FROM SAP ASE TO SQL Server Azure VM

MINUTE BY MINUITE PRODUCTION RUNBOOK FOR  FULLY AUTOMATED MIGRATION FROM SAP ASE TO SQL Server Azure VM --- OVERALL STRUCTURE Breaking execu...