Making-sense of the Evolution of Data, and Database Technology: Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10

Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB)

More importantly:

*How to detect Failures early

*How to recover safely

*How to prevent Failures next time

No theory—this is what breaks in real life.

SCENARIO 1: TRANSACTION LOG CHAIN BROKEN

What Happens:

During log shipping or continuous backup:

* Someone runs a FULL backup outside your process

* Or switches recovery model (FULL → SIMPLE)

Result: You can no longer restore logs on EC2

Symptoms

On restore:

```sql

RESTORE LOG YourDB

FROM DISK = 'file.trn';

```

Error:

```

This log backup cannot be applied because it is out of sequence

```

Root Cause

* Log chain continuity is broken

* LSN mismatch

Recovery (Real Fix)

Option 1 (Best): Restart Sync

1. Take new full backup:

```sql

BACKUP DATABASE YourDB TO DISK = 'new_full.bak' WITH COMPRESSION;

```

2. Restore again on EC2:

```sql

RESTORE DATABASE YourDB

FROM DISK = 'new_full.bak'

WITH NORECOVERY;

```

3. Resume log backups

Option 2 (Emergency)

If downtime is acceptable:

* Take final full backup

* Restore WITH RECOVERY

* Switch immediately

Prevention

* Restrict backup permissions

* Use dedicated backup jobs only

* Monitor LSN continuity

SCENARIO 2: MISSING LOG FILE DURING MIGRATION

What Happens

One `.trn` file fails to copy.

Symptoms

```sql

RESTORE LOG YourDB

```

Error:

The log in this backup set begins at LSN X, which is too recent

```

Root Cause

* Missing log file in sequence

Recovery

Step 1: Identify Missing File

Check restore history:

```sql

SELECT * FROM msdb.dbo.restorehistory

ORDER BY restore_date DESC;

```

Step 2: Re-copy Missing File

Use:

```powershell

robocopy G:\Backup \\EC2\Backup missing_file.trn

```

Step 3: Restore Again

Prevention

* Never delete `.trn` files until applied

* Use checksum validation

* Use automated tracking

SCENARIO 3: RESTORE TAKES TOO LONG (10–100TB)

What Happens

Restore runs for:

* 10+ hours

* 24+ hours

Root Cause

* Low IOPS

* Single backup file

* Small instance size

Recovery (During Migration)

You cannot speed up a running restore much.

Best move:

* Let it finish

* Optimize next run

Prevention (CRITICAL)

Use Multi-File Backup

```sql

BACKUP DATABASE YourDB

TO DISK = 'file1.bak', DISK = 'file2.bak', DISK = 'file3.bak'

WITH COMPRESSION;

```

Increase AWS Storage Performance

* Use io2 or high IOPS gp3

* Pre-provision IOPS

Use Larger EC2 Instance

* Memory + CPU matters

SCENARIO 4: DISK FULL ON EC2

What Happens

Restore fails midway:

```

There is not enough space on disk

```

Recovery

Step 1: Add New EBS Volume

* Attach to EC2

* Format disk

Step 2: Move Files

```sql

ALTER DATABASE YourDB

MODIFY FILE (NAME = YourDB_Data, FILENAME = 'NewPath\YourDB.mdf');

```

Step 3: Restart Restore

Prevention

* Always calculate: DB size × 1.5 (minimum space)

SCENARIO 5: APPLICATION FAILS AFTER CUTOVER

What Happens

Migration succeeded, but:

* App errors

* Login failures

* Queries break

Root Causes

* Missing logins

* Orphan users

* Connection strings not updated

Recovery

Fix Logins

```sql

ALTER USER username WITH LOGIN = loginname;

```

Recreate Logins with SID

Use login script from source.

Fix Connection String

Point app to EC2 hostname.

Prevention

* Test application BEFORE cutover

* Script all logins

SCENARIO 6: DATA MISMATCH AFTER MIGRATION

What Happens

* Row counts differ

* Financial data incorrect

Root Causes

* Missed final log backup

* Writes during cutover

Recovery

If Source Still Running

1. Take final log backup:

```sql

BACKUP LOG YourDB TO DISK = 'final.trn';

```

2. Apply to EC2:

```sql

RESTORE LOG YourDB FROM DISK = 'final.trn' WITH RECOVERY;

```

If Already Switched

Critical situation:

* Compare data

* Manually reconcile

* Possibly rollback

Prevention

* Freeze writes before final backup

* Validate row counts

SCENARIO 7: AWS NETWORK BOTTLENECK

What Happens

* Transfer extremely slow

* Migration takes days

Root Cause

* Limited bandwidth

* No Direct Connect

Recovery

* Switch to AWS DataSync

* Use parallel copy

Prevention

* Test network speed early

* Use multiple streams

SCENARIO 8: ALWAYS ON FAILOVER DOES NOT WORK

What Happens

Failover fails:

```

The availability group is not synchronized

```

Root Cause

* Secondary not synced

* Network issue

Recovery

Check Sync State

```sql

SELECT synchronization_state_desc

FROM sys.dm_hadr_database_replica_states;

```

Fix:

* Resume data movement

* Re-seed database if needed

Prevention

* Monitor sync continuously

* Use synchronous commit for critical DBs

SCENARIO 9: HIGH CPU AFTER MIGRATION

What Happens

System becomes slow after moving to EC2.

Root Cause

* Missing indexes

* Old statistics

* Different hardware profile

Recovery

```sql

EXEC sp_updatestats;

```

```sql

ALTER INDEX ALL ON YourTable REBUILD;

```

Prevention

* Run tuning after migration

* Capture baseline before migration

SCENARIO 10: TOTAL MIGRATION FAILURE (ROLLBACK)

What Happens

Everything goes wrong:

* Data mismatch

* App failure

* Performance collapse

Recovery (Your Lifeline)

Rollback plan

Step 1: Switch Application Back

* Point to on-prem SQL Server

Step 2: Keep EC2 for analysis

DO NOT delete it.

Step 3: Investigate

* Logs

* Data differences

* Performance metrics

Prevention

* Always keep source system intact

* Never delete original until stable

MASTER FAILURE RECOVERY MINDSET

When something fails:

Step 1: STOP

Do not rush fixes.

Step 2: IDENTIFY

* What failed?

* Where?

* Why?

Step 3: ISOLATE

* Data issue?

* Network?

* SQL Server?

Step 4: RECOVER

* Use backups

* Reapply logs

* Rollback if needed

Step 5: DOCUMENT

* What happened

* Fix applied

* Prevention next time

---

For 10TB–100TB migrations, expect:

* Something WILL fail

* The difference is: Prepared DBA will have smooth recovery whereas unprepared DBA will face disaster.

Making-sense of the Evolution of Data, and Database Technology

Thursday, May 7, 2026

Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB)

No comments:

Post a Comment

MINUTE BY MINUITE PRODUCTION RUNBOOK FOR FULLY AUTOMATED MIGRATION FROM SAP ASE TO SQL Server Azure VM

Report Abuse

Labels