Thursday, May 7, 2026

Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB)

 Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB)


More importantly:


*How to detect Failures early

*How to recover safely

*How to prevent Failures next time


No theory—this is what breaks in real life.



SCENARIO 1: TRANSACTION LOG CHAIN BROKEN



What Happens:


During log shipping or continuous backup:


* Someone runs a FULL backup outside your process


* Or switches recovery model (FULL → SIMPLE)


Result: You can no longer restore logs on EC2



Symptoms


On restore:


```sql


RESTORE LOG YourDB 

FROM DISK = 'file.trn';

```


Error:


```

This log backup cannot be applied because it is out of sequence

```



Root Cause


* Log chain continuity is broken

* LSN mismatch



Recovery (Real Fix)



Option 1 (Best): Restart Sync


1. Take new full backup:


```sql


BACKUP DATABASE YourDB TO DISK = 'new_full.bak' WITH COMPRESSION;

```


2. Restore again on EC2:


```sql


RESTORE DATABASE YourDB 

FROM DISK = 'new_full.bak'

WITH NORECOVERY;

```


3. Resume log backups



Option 2 (Emergency)


If downtime is acceptable:


* Take final full backup

* Restore WITH RECOVERY

* Switch immediately



Prevention


* Restrict backup permissions

* Use dedicated backup jobs only

* Monitor LSN continuity



SCENARIO 2: MISSING LOG FILE DURING MIGRATION



What Happens


One `.trn` file fails to copy.



Symptoms


```sql


RESTORE LOG YourDB

```


Error:



The log in this backup set begins at LSN X, which is too recent

```



Root Cause


* Missing log file in sequence



Recovery


Step 1: Identify Missing File


Check restore history:


```sql


SELECT * FROM msdb.dbo.restorehistory

ORDER BY restore_date DESC;

```



Step 2: Re-copy Missing File


Use:


```powershell


robocopy G:\Backup \\EC2\Backup missing_file.trn

```



Step 3: Restore Again



Prevention


* Never delete `.trn` files until applied

* Use checksum validation

* Use automated tracking



SCENARIO 3: RESTORE TAKES TOO LONG (10–100TB)



What Happens


Restore runs for:


* 10+ hours

* 24+ hours



Root Cause


* Low IOPS

* Single backup file

* Small instance size



Recovery (During Migration)


You cannot speed up a running restore much.


Best move:


* Let it finish

* Optimize next run



Prevention (CRITICAL)


Use Multi-File Backup


```sql


BACKUP DATABASE YourDB

TO DISK = 'file1.bak', DISK = 'file2.bak', DISK = 'file3.bak'

WITH COMPRESSION;

```



Increase AWS Storage Performance


* Use io2 or high IOPS gp3

* Pre-provision IOPS



Use Larger EC2 Instance


* Memory + CPU matters



SCENARIO 4: DISK FULL ON EC2



What Happens


Restore fails midway:


```

There is not enough space on disk

```



Recovery


Step 1: Add New EBS Volume


* Attach to EC2

* Format disk



Step 2: Move Files


```sql


ALTER DATABASE YourDB 

MODIFY FILE (NAME = YourDB_Data, FILENAME = 'NewPath\YourDB.mdf');

```



Step 3: Restart Restore



Prevention


* Always calculate: DB size × 1.5 (minimum space)




SCENARIO 5: APPLICATION FAILS AFTER CUTOVER


What Happens


Migration succeeded, but:


* App errors

* Login failures

* Queries break



Root Causes


* Missing logins

* Orphan users

* Connection strings not updated



Recovery


Fix Logins


```sql


ALTER USER username WITH LOGIN = loginname;

```



Recreate Logins with SID


Use login script from source.


Fix Connection String


Point app to EC2 hostname.



Prevention


* Test application BEFORE cutover

* Script all logins




SCENARIO 6: DATA MISMATCH AFTER MIGRATION



What Happens


* Row counts differ

* Financial data incorrect



Root Causes


* Missed final log backup

* Writes during cutover



Recovery



If Source Still Running


1. Take final log backup:


```sql


BACKUP LOG YourDB TO DISK = 'final.trn';

```


2. Apply to EC2:


```sql


RESTORE LOG YourDB FROM DISK = 'final.trn' WITH RECOVERY;

```



If Already Switched



Critical situation:


* Compare data

* Manually reconcile

* Possibly rollback



Prevention


* Freeze writes before final backup

* Validate row counts




SCENARIO 7: AWS NETWORK BOTTLENECK



What Happens


* Transfer extremely slow

* Migration takes days



Root Cause


* Limited bandwidth

* No Direct Connect



Recovery


* Switch to AWS DataSync

* Use parallel copy



Prevention


* Test network speed early

* Use multiple streams



SCENARIO 8: ALWAYS ON FAILOVER DOES NOT WORK



What Happens


Failover fails:


```

The availability group is not synchronized

```



Root Cause


* Secondary not synced

* Network issue



Recovery


Check Sync State


```sql


SELECT synchronization_state_desc

FROM sys.dm_hadr_database_replica_states;

```



Fix:


* Resume data movement

* Re-seed database if needed



Prevention


* Monitor sync continuously

* Use synchronous commit for critical DBs



SCENARIO 9: HIGH CPU AFTER MIGRATION



What Happens


System becomes slow after moving to EC2.



Root Cause


* Missing indexes

* Old statistics

* Different hardware profile



Recovery


```sql


EXEC sp_updatestats;

```


```sql


ALTER INDEX ALL ON YourTable REBUILD;

```



Prevention


* Run tuning after migration

* Capture baseline before migration



SCENARIO 10: TOTAL MIGRATION FAILURE (ROLLBACK)



What Happens


Everything goes wrong:


* Data mismatch

* App failure

* Performance collapse


Recovery (Your Lifeline)


Rollback plan


Step 1: Switch Application Back


* Point to on-prem SQL Server



Step 2: Keep EC2 for analysis


DO NOT delete it.



Step 3: Investigate


* Logs

* Data differences

* Performance metrics



Prevention


* Always keep source system intact

* Never delete original until stable



MASTER FAILURE RECOVERY MINDSET


When something fails:



Step 1: STOP


Do not rush fixes.



Step 2: IDENTIFY


* What failed?

* Where?

* Why?



Step 3: ISOLATE


* Data issue?

* Network?

* SQL Server?



Step 4: RECOVER


* Use backups

* Reapply logs

* Rollback if needed



Step 5: DOCUMENT


* What happened

* Fix applied

* Prevention next time


---


For 10TB–100TB migrations, expect:


* Something WILL fail


* The difference is: Prepared DBA will have smooth recovery whereas unprepared DBA will face disaster.


Runbook for Production Migration from OnPrem SQL SERVER to SQL SERVER AWS EC2 with Minute-by-Minute Timing

 Runbook for Production Migration from OnPrem SQL SERVER to SQL SERVER AWS EC2 with Minute-by-Minute Timing


Scenario Assumptions


* Source: On-prem SQL Server


* Target: SQL Server on Amazon EC2


* Method: Backup + Restore + Log Shipping


* Downtime target: 15–30 minutes


* Database size: 10TB–100TB


* Migration window: Weekend (example: Saturday night)




BIG PICTURE FLOW


* Days before → prepare and sync


* Final hour → freeze + final logs


* Minutes → cutover


* After → validate + stabilize



T-7 DAYS (ONE WEEK BEFORE)


Goal: Eliminate surprises


Tasks:


* Full inventory completed


* Test full backup + restore on EC2


* Measure restore time


* Test log shipping loop


* Validate application connectivity


Must Confirm:


* You can restore DB successfully on EC2

* Log shipping works continuously

* Performance is acceptable


T-1 DAY (FINAL PREPARATION)



Goal: Be 100% ready



Tasks:


* Notify stakeholders


* Confirm downtime window


* Freeze schema changes


* Verify backups


Run Final Pre-Check Script


```sql 


SELECT name, state_desc FROM sys.databases;

DBCC CHECKDB('YourDB') WITH NO_INFOMSGS;

```



T-6 HOURS (START INITIAL LOAD)



Goal: Get most data to EC2 before downtime



Step 1: Take Full Backup


```sql 


BACKUP DATABASE YourDB

TO DISK = 'G:\Backup\YourDB_1.bak',

   DISK = 'G:\Backup\YourDB_2.bak',

   DISK = 'G:\Backup\YourDB_3.bak'

WITH COMPRESSION, STATS = 5;

```



Step 2: Transfer Backup


```powershell 


robocopy G:\Backup \\EC2-SQL\Backup *.bak /MT:16

```



Step 3: Restore on EC2 (NORECOVERY)


```sql 


RESTORE DATABASE YourDB

FROM DISK = 'G:\Backup\YourDB_1.bak',

     DISK = 'G:\Backup\YourDB_2.bak',

     DISK = 'G:\Backup\YourDB_3.bak'

WITH NORECOVERY;

```



T-4 HOURS


Goal: Start continuous sync



Step: Start Log Shipping Loop


Every 5 minutes:


```sql 


BACKUP LOG YourDB TO DISK = 'G:\Backup\log.trn';

```


Copy + restore automatically.



Monitor:


* Logs applying correctly


* No gaps


* No errors


T-60 MINUTES (CRITICAL PHASE STARTS)



Goal: Prepare for cutover



Tasks:


* Notify users: “System going read-only soon”


* Ensure log shipping is current


* Validate EC2 database



Validation Script


```sql 


SELECT COUNT(*) FROM CriticalTable;

```


Compare with source.



T-30 MINUTES



Goal: Reduce final sync gap



Change Log Backup Frequency


* From every 5 min → every 1 min



Confirm:


* No backlog of logs


* EC2 nearly in sync



T-15 MINUTES (POINT OF NO RETURN NEAR)



Goal: Prepare final cutover



Tasks:


* Stop batch jobs


* Pause ETL processes


* Notify: “Final downtime starting”



T-10 MINUTES (WRITE FREEZE)



Goal: Stop all writes



Step 1: Disable Application Writes


Options:


* Stop application services


* Or set DB to read-only


---


```sql 


ALTER DATABASE YourDB SET READ_ONLY;

```



Step 2: Verify No Active Transactions


```sql


SELECT * FROM sys.dm_tran_active_transactions;

```


T-5 MINUTES (FINAL SYNC)



Goal: Capture last changes



Step 1: Final Log Backup


```sql 


BACKUP LOG YourDB 

TO DISK = 'G:\Backup\final.trn'

WITH NORECOVERY;

```



Step 2: Copy Final Log


```powershell 


robocopy G:\Backup \\EC2-SQL\Backup final.trn

```



Step 3: Apply Final Log


```sql 


RESTORE LOG YourDB

FROM DISK = 'G:\Backup\final.trn'

WITH RECOVERY;

```



T-0 (CUTOVER MOMENT)


THIS IS THE SWITCH



Step 1: Bring EC2 Database Online


Already done with `WITH RECOVERY`




Step 2: Update Application Connection


Change:


* Connection string


* DNS alias (preferred)



Step 3: Enable Application


* Start services


* Allow connections



T+5 MINUTES


Goal: Smoke Test



Test:


* Login works


* Key queries run


* Application loads



SQL Test


```sql 


SELECT TOP 10 * FROM CriticalTable;

```



T+15 MINUTES


Goal: Validate system health



Run:


```sql 


DBCC CHECKDB('YourDB') WITH NO_INFOMSGS;

```


Check:


* CPU usage


* Disk latency


* Errors



T+30 MINUTES



Goal: Confirm success




Tasks


* Business validation (very important)


* Confirm no missing data


* Monitor performance



T+1 HOUR



Goal: Stabilization



Tasks


* Enable SQL Agent jobs


* Resume ETL


* Monitor logs



T+24 HOURS



Goal: Final confirmation



Tasks:


* Full backup on EC2


* Review performance


* Confirm no issues



ROLLBACK PLAN (MUST EXIST)


If something fails at T+5 or T+15:


Step 1: Redirect Application Back


* Point to on-prem SQL Server



Step 2: Re-enable Writes


```sql 


ALTER DATABASE YourDB SET READ_WRITE;

```



Step 3: Investigate


---



REAL-WORLD TIMING SUMMARY



Bulk Data Movement:


* Happens hours before (T-6h)



Sync Phase:


* Continuous (T-4h → T-10m)



Downtime Window:


Only:T-10 minutes → T+5 minutes



PRO TIPS 


Tip 1: Always Use DNS


Instead of changing app config: Point DNS to EC2



Tip 2: Practice Twice


* First run → learning


* Second run → timing



Tip 3: Record Everything


* Start time


* End time


* Issues



Tip 4: Have War Room


* DBA


* App team


* Network team


---



REALITY Check:


A successful migration is not about just moving data rather it is Timing,  Coordination, Recovery readiness


Wednesday, May 6, 2026

Live Migration Simulation War Room with Different Failure Scenarios

Live Migration Simulation War Room with Different Failure Scenarios

(Migrating From On-prem SQL Server To SQL Server on AWS EC2)


This is designed like a live migration war room scenarios, where things go wrong at specific minutes and you must react correctly.


Scenario Setup


* Database: 25TB (mission-critical financial system)


* Method: Backup + Restore + Log Shipping


* Downtime target: 20 minutes


* Migration window: Saturday 10:00 PM




SIMULATION MODE


We’ll simulate time exactly like a real migration.


You are the DBA. this article tell you what happens.


Your job: understand the correct response.



T-6 HOURS (4:00 PM)



Status: Everything looks good


* Full backup completed


* Restore started on EC2



FAILURE #1: RESTORE IS TOO SLOW


At T-4 hours, restore is only 40% complete.


What this means: You will miss your migration window.


Root Cause:


* Not enough IOPS on EC2


* Backup not split into enough files


Correct Response


Option A (Best if early enough)


* Stop restore


* Increase disk IOPS


* Restart restore


Option B (If too late)


Continue restore and:


* Delay migration window


* Inform stakeholders immediately


Wrong Move


* Ignoring the delay and hoping it finishes



T-60 MINUTES (9:00 PM)


Status: Restore completed, log shipping running


Everything seems fine.



FAILURE #2: LOG SHIPPING SILENTLY FAILED


You check logs:


```sql


SELECT * FROM msdb.dbo.restorehistory;

```


Last log applied = 30 minutes ago


What this means


EC2 is 30 minutes behind production


Root Cause:


* Copy job failed


* Network interruption


* Disk full on EC2



Correct Response:


Step 1: STOP CUTOVER PLAN


Do NOT proceed.



Step 2: Identify Missing Logs


Check source backup folder.



Step 3: Re-copy logs


```powershell


robocopy G:\Backup \\EC2-SQL\Backup *.trn

```


Step 4: Apply logs manually


```sql


RESTORE LOG YourDB FROM DISK = 'missing.trn' WITH NORECOVERY;

```



Step 5: Confirm sync is current



 Wrong Move:


* Proceeding with cutover while behind, that can cause data loss



T-15 MINUTES (9:45 PM)



Status: Sync restored, all good



FAILURE #3: ACTIVE TRANSACTIONS WON’T STOP


You try to freeze writes.


```sql


ALTER DATABASE YourDB SET READ_ONLY;

```


It hangs.



Root Cause:


* Long-running transactions still active



Correct Response:



Step 1: Identify blocking sessions


```sql


SELECT session_id, blocking_session_id

FROM sys.dm_exec_requests;

```



Step 2: Kill long transactions (carefully)


```sql


KILL <session_id>;

```



Step 3: Confirm zero active writes


Wrong Move:


* Forcing cutover without stopping transactions, which can cause inconsistent data.



T-5 MINUTES (9:55 PM)


Status: Ready for final log backup


FAILURE #4: FINAL LOG BACKUP FAILS


```sql


BACKUP LOG YourDB

```


Error:


```plaintext


BACKUP LOG cannot be performed because there is no current database backup

```


Root Cause: someone switched recovery model to SIMPLE earlier



CRITICAL SITUATION:


*You cannot continue log chain

*You cannot guarantee data consistency


Correct Response:


Option A (Best)


Abort migration, which means


* Re-enable application


* Switch DB back to READ_WRITE


Option B (If business forces migration)


* Take FULL backup immediately


```sql


BACKUP DATABASE YourDB TO DISK = 'emergency_full.bak';

```


* Restore on EC2 WITH RECOVERY


* Accept downtime with risk



Wrong Move:


* Ignoring error and continuing



T-0 (10:00 PM CUTOVER)


Status: You proceed successfully



FAILURE #5: APPLICATION FAILS


Users report:


* Login failures


* “User not found”


Root Cause:


* Missing SQL logins (SID mismatch)



Correct Response:


Fix orphan users


```sql


ALTER USER username WITH LOGIN = loginname;

```



Or recreate logins with SID


Wrong Move:


* Restarting SQL Server blindly




T+10 MINUTES



FAILURE #6: DATA MISMATCH FOUND


Finance team says: “Totals don’t match”



Root Cause:


* Final log not applied correctly


* Writes occurred during cutover


CRITICAL DECISION POINT



Correct Response:


If source DB still intact:


1. Take new log backup:


```sql


BACKUP LOG YourDB TO DISK = 'fix.trn';

```


2. Apply to EC2:


```sql


RESTORE LOG YourDB FROM DISK = 'fix.trn' WITH RECOVERY;

```



If not fixable: ROLLBACK



Wrong Move:


* Ignoring mismatch can become a business disaster



T+20 MINUTES


FAILURE #7: PERFORMANCE COLLAPSE


System is online but:


* Queries slow


* CPU 100%



Root Cause:


* Missing statistics


* Different execution plans


Correct Response:


```sql


EXEC sp_updatestats;

```


```sql


ALTER INDEX ALL ON YourTable REBUILD;

```



Wrong Move:


* Increasing instance size immediately without diagnosis



T+30 MINUTES


FINAL DECISION POINT



If everything stable: declare success



If major issue persists: execute rollback



FULL ROLLBACK SIMULATION


---


Step 1: Redirect application back


* Change DNS or connection string



Step 2: Enable writes on source


```sql


ALTER DATABASE YourDB SET READ_WRITE;

```



Step 3: Notify stakeholders



Step 4: Investigate root cause



LESSONS 



1. Most failures happen BEFORE cutover


* Restore speed


* Log sync



2. Most dangerous moment


* Final 10 minutes



3. Biggest risks


* Broken log chain


* Data inconsistency


* Missing logins



4. Golden rule


NEVER proceed if unsure


---


HOW TO PRACTICE 


You can practice this safely:



Lab Practice


1. Create test DB (100GB+)


2. Run migration


3. Introduce failures:


   * Delete a log file


   * Break log chain


   * Kill network


   * Change recovery model


Goal: Train your reaction, not just knowledge


---


The real DBAs is not judged by how smoothly things go but how well they recover when things break.

BUILDING MONITORING & ALERTING SYSTEM After Migrating From On-prem SQL Server To SQL Server on AWS EC2

 BUILDING PRODUCTION MONITORING & ALERTING SYSTEM After Migrating From On-prem SQL Server To SQL Server on AWS EC2



This guides to build a production-ready monitoring system that can:


* Detects problems early


* Alerts the right people


* Helps you diagnose fast


* Prevents outages



BIG PICTURE ARCHITECTURE


You are building 4 layers:


1. SQL Server Internal Monitoring (inside SQL Server)


2. OS + EC2 Monitoring(via AWS)


3. Central Logging & Alerting (SQL Agent + email + logs)


4. Response Automation(PowerShell + alerts)




WHAT YOU MUST MONITOR (CRITICAL SIGNALS)




DATABASE HEALTH:


* Database state (ONLINE/OFFLINE)


* Corruption


* Backup success/failure




PERFORMANCE:


* CPU usage


* Slow queries


* Blocking / deadlocks




STORAGE:


* Disk space


* IOPS latency




AVAILABILITY:


* Always On status


* Connection failures




PART 1: SQL SERVER MONITORING (CORE)




1.1 Create Monitoring Database


```sql 


CREATE DATABASE DBA_Monitoring;

GO

```


---


1.2 Table for Alerts


```sql 


USE DBA_Monitoring;


CREATE TABLE Alerts (

    AlertID INT IDENTITY(1,1),

    AlertType NVARCHAR(100),

    Message NVARCHAR(MAX),

    Severity INT,

    CreatedAt DATETIME DEFAULT GETDATE()

);

```




PART 2: CRITICAL MONITORING QUERIES




2.1 Check Database Status


```sql 


INSERT INTO DBA_Monitoring.dbo.Alerts (AlertType, Message, Severity)

SELECT 

    'Database Offline',

    name + ' is not ONLINE',

    1

FROM sys.databases

WHERE state_desc <> 'ONLINE';

```




2.2 Check Failed Backups


```sql 


INSERT INTO DBA_Monitoring.dbo.Alerts (AlertType, Message, Severity)

SELECT 

    'Backup Failure',

    'No backup in last 24 hours',

    1

WHERE NOT EXISTS (

    SELECT 1 FROM msdb.dbo.backupset

    WHERE backup_finish_date > DATEADD(HOUR, -24, GETDATE())

);

```




2.3 Check Disk Space


```sql 


EXEC xp_fixeddrives;

```


Alert if free space < 20%




2.4 Check Blocking


```sql 


SELECT blocking_session_id, session_id

FROM sys.dm_exec_requests

WHERE blocking_session_id <> 0;

```




2.5 Check Long Running Queries


```sql 


SELECT TOP 10

    total_elapsed_time / execution_count AS AvgTime,

    execution_count

FROM sys.dm_exec_query_stats

ORDER BY AvgTime DESC;

```




2.6 Deadlock Detection


```sql 


DBCC TRACEON (1222, -1);

```




PART 3: SQL AGENT ALERT JOB




3.1 Create Monitoring Job


```sql 


USE msdb;


EXEC sp_add_job @job_name = 'DBA_Monitoring_Job';

```


3.2 Add Step


```sql 


EXEC sp_add_jobstep

@job_name = 'DBA_Monitoring_Job',

@step_name = 'Run Checks',

@subsystem = 'TSQL',

@command = '

EXEC check_script for Database Status;

EXEC check_script for Backup Failure;

';

```


3.3 Schedule Every 5 Minutes


```sql 


EXEC sp_add_schedule

@schedule_name = 'Every5Min',

@freq_type = 4,

@freq_subday_type = 4,

@freq_subday_interval = 5;

```




PART 4: EMAIL ALERTING SYSTEM


4.1 Configure Database Mail


```sql 


EXEC sp_configure 'Database Mail XPs', 1;

RECONFIGURE;

```


4.2 Send Alert Email


```sql 


EXEC msdb.dbo.sp_send_dbmail

@profile_name = 'DBA_Profile',

@recipients = 'dba@company.com',

@subject = 'SQL Alert',

@body = 'Critical issue detected';

```


4.3 Auto Email on Alert


```sql 


IF EXISTS (SELECT 1 FROM DBA_Monitoring.dbo.Alerts WHERE Severity = 1)

BEGIN

    EXEC msdb.dbo.sp_send_dbmail

    @profile_name = 'DBA_Profile',

    @recipients = 'dba@company.com',

    @subject = 'CRITICAL SQL ALERT',

    @body = 'Check DBA_Monitoring database immediately';

END

```




PART 5: AWS CLOUDWATCH MONITORING


5.1 Key Metrics to Enable


Monitor in AWS:


* CPUUtilization

* DiskReadOps / WriteOps

* NetworkIn / Out


5.2 CloudWatch Alarm Example


Trigger alert if:


* CPU > 80% for 5 minutes


5.3 Disk Alert


Trigger if:


* Free space < 15%




PART 6: POWERSHELL ALERT AUTOMATION


6.1 Check Disk Space Script


```powershell 


$drives = Get-PSDrive -PSProvider FileSystem


foreach ($drive in $drives) {

    if ($drive.Free -lt 20GB) {

        Write-Host "Low disk space on $($drive.Name)"

    }

}

```


6.2 Send Alert


```powershell 


Send-MailMessage -To "dba@company.com" `

-From "alert@company.com" `

-Subject "Disk Alert" `

-Body "Low disk space detected" `

-SmtpServer "smtp.server.com"

```




PART 7: ALWAYS ON MONITORING


7.1 Check Replica Health


```sql 


SELECT 

    replica_server_name,

    synchronization_state_desc

FROM sys.dm_hadr_database_replica_states;

```


Alert if: NOT SYNCHRONIZED




PART 8: ALERT SEVERITY MODEL


Severity Levels


* 1 (Critical) → Immediate action (DB down, corruption)


* 2 (Warning) → Investigate (slow queries)


* 3 (Info) → Logging only




ART 9: RESPONSE AUTOMATION


Example: Restart Failed Job


```powershell 


Invoke-Sqlcmd -Query "EXEC msdb.dbo.sp_start_job 'JobName'"

```


Example: Kill Blocking Session


```sql 


KILL <session_id>;

```




PART 10: DAILY DBA DASHBOARD QUERY



```sql 


SELECT TOP 50 *

FROM DBA_Monitoring.dbo.Alerts

ORDER BY CreatedAt DESC;

```


PRODUCTION CHECKLIST


You are production-ready if:



SQL Level


* Monitoring job running


* Alerts being logged


* Emails working


AWS Level


* CloudWatch alarms active

* Disk + CPU monitored



Response Level


* Scripts tested


* Alerts actionable


---


Most outages happen because there is No alert or Alert ignored or Alert unclear.


War Room Script During The Actual Migration From On-prem SQL Server To SQL Server on AWS EC2

War Room Script During The Actual Migration From On-prem SQL Server To SQL Server on AWS EC2


This separates a “technical person” from someone who can actually run a high-risk production migration. The template includes:


* who speaks


* who acts


* what to say /write


*when to pause, escalate, or rollback


With this script, you can:


* Lead a real production migration


* Speak with authority and clarity


* Handle failures without panic



Scenario:


* Migration: On-prem SQL Server → SQL Server on EC2


* Database: 10–100TB


* Downtime target: 20 minutes


* Environment: Mission-critical (finance / production)


WAR ROOM ROLES with CLEAR RESPONSIBILITIES


You must assign these BEFORE starting.


1. Migration Commander (YOU / LEAD DBA)


* Final decision maker


* Controls timeline


* Approves cutover / rollback



2. SQL Server Execution DBA


* Runs SQL scripts


* Monitors backup/restore/logs


3. Infrastructure Engineer


* EC2, storage, network


* Handles disk / IOPS / connectivity


4. Application Owner


* Validates application


* Confirms business functionality


5. Observer / Scriber


* Writes down:


  * timestamps


  * issues


  * decisions



WAR ROOM RULES (READ THIS OUT LOUD / COMNICATE CLEARLY AT START)


*Opening Statemen by Migration Commander


"This is a controlled production migration. Only the assigned person speaks during each step. No action is taken without confirmation. If any critical issue occurs, we pause immediately. If data consistency is at risk, we rollback without hesitation."



PHASE 1: T-30 MINUTES (PRE-CUTOVER)


Migration Commander:


"We are at T minus 30 minutes.

Confirm all systems are ready.

SQL DBA, give me sync status."


SQL DBA:


"Log shipping is active.

Last log applied is within 1 minute.

No errors detected."


Commander to Infra


"Infrastructure status?"



Infra Engineer:


"EC2 is stable.

Disk usage is within limits.

No network issues."


Commander Decision:


"Proceed to pre-cutover validation."



PHASE 2: T-15 MINUTES (PREPARE FREEZE)


Commander:


"Application team, prepare to stop writes.

Confirm when ready."


Application Owner:


"Application ready for write freeze.

No critical jobs running."


PHASE 3: T-10 MINUTES (WRITE FREEZE)



Commander:


"Execute write freeze now."


SQL DBA (after running):


"Database set to read-only.

Checking active transactions."


SQL DBA (confirmation):


"No active transactions remaining."



IF FAILURE (TRANSACTIONS NOT STOPPING)


SQL DBA:


"We have active blocking transactions.

Requesting permission to terminate sessions."


Commander:


"Approved. Terminate only long-running sessions.

Proceed carefully."



PHASE 4: T-5 MINUTES (FINAL SYNC)



Commander:


"Proceed with final log backup."


SQL DBA:


"Final log backup started."



SQL DBA (after backup):


"Final log backup completed.

Copying to EC2."


SQL DBA:


"Final log restored with recovery.

Target database is online."



FAILURE BRANCH (CRITICAL)



If final log fails



SQL DBA:


"Final log backup failed.

Log chain may be broken."



Commander (STRICT RESPONSE):


"Stop all cutover activities.

We are aborting migration.

Re-enable source database immediately."



PHASE 5: T-0 (CUTOVER)



Commander:


"We are at cutover point.

Switch application connections to EC2."



Application Owner:


"Connection updated.

Starting application services."



PHASE 6: T+5 MINUTES (SMOKE TEST)




Commander:


"Run smoke tests now.

Application team, report status."


Application Owner:


"Application is accessible.

Basic functions are working."



Commander:


"SQL DBA, give me Database performance status."



SQL DBA:


"Database performance normal.

No errors in logs."



FAILURE BRANCH (APP FAILURE)


Application Owner:


"Users cannot log in.

Authentication errors detected."


SQL DBA:


"Likely orphaned users or missing logins.

Fixing now."


Commander:


"You have 10 minutes to resolve.

If unresolved, we rollback."




PHASE 7: T+15 MINUTES (VALIDATION)




Commander:


"Business validation in progress.

Application team, confirm data integrity."




Application Owner:


"Data appears consistent.

Key reports match expectations."




FAILURE BRANCH (DATA MISMATCH)




Application Owner:


"Data mismatch detected in financial totals."




Commander (NO DELAY)


"Stop.

We are initiating rollback.

Switch application back to source immediately."




ROLLBACK SCRIPT




Commander:


"Rollback initiated.

Application team, revert connection now."




SQL DBA:


"Source database set to read-write.

System restored."




PHASE 8: T+30 MINUTES (SUCCESS DECLARATION)




Commander:


"Migration successful.

System stable for 30 minutes.

Proceed to monitoring phase."




CRITICAL COMMANDER DECISION RULES




RULE 1: NEVER GUESS


If unsure:


"Pause all actions.

We need confirmation before proceeding."




RULE 2: DATA MORE IMPORTANT THAN SPEED


If risk:


"We are prioritizing data integrity over timeline."




RULE 3: CLEAR TIMEBOX


Every issue:


"You have 10 minutes to resolve.

After that, we rollback."



Most migrations don’t fail because of technology but because of confusion,  poor communication, and no clear leader.

Tuesday, May 5, 2026

USE OF AUTOMATION IN THE MIGRATING OnPrem SQL Server to AWS EC2 With VLDBs

 USE OF AUTOMATION IN THE MIGRATING OnPrem SQL Server to AWS EC2 With VLDBs


:


* PowerShell → orchestration, file movement, scheduling


* SQL scripts → backup, restore, validation, HADR



Automate:


1. Full backup (multi-file, compressed)

2. Transaction log backups (loop)

3. File transfer (robocopy)

4. Restore (full + logs)

5. Continuous sync loop

6. Final cutover

7. Post-migration validation

8. Optional HADR setup (Always On)



BEFORE YOU RUN ANYTHING, Please Customize/Edit these variables everywhere:


```Sample powershell Script


$SourceServer = "ONPREM-SQL"

$TargetServer = "EC2-SQL"

$Database = "YourDB"

$BackupPath = "G:\Backup"

$TargetBackupPath = "\\EC2-SQL\Backup"

$DataPath = "D:\Data"

$LogPath = "E:\Logs"

```



PART 1: FULL BACKUP AUTOMATION (SQL)


Run on source (on-prem SQL Server)



1.1 Full Backup Script (Multi-file + Compression)


```Sample SQL Script


DECLARE @DBName NVARCHAR(100) = 'YourDB';

DECLARE @Path NVARCHAR(200) = 'G:\Backup\';

DECLARE @SQL NVARCHAR(MAX);


SET @SQL = '

BACKUP DATABASE [' + @DBName + ']

TO 

DISK = ''' + @Path + @DBName + '_1.bak'',

DISK = ''' + @Path + @DBName + '_2.bak'',

DISK = ''' + @Path + @DBName + '_3.bak'',

DISK = ''' + @Path + @DBName + '_4.bak''

WITH COMPRESSION, STATS = 5;

';


EXEC(@SQL);

```


PART 2: FILE TRANSFER AUTOMATION (PowerShell)


Run on **source server**



2.1 Copy Backup Files to EC2


```Sample powershell Script


$Source = "G:\Backup"

$Destination = "\\EC2-SQL\Backup"


robocopy $Source $Destination *.bak /E /Z /MT:16 /R:3 /W:5 /LOG:C:\copy.log

```



PART 3: RESTORE FULL BACKUP (SQL)


Run on target (EC2 SQL Server)



3.1 Restore with NORECOVERY


```Sample SQL  Script


RESTORE DATABASE [YourDB]

FROM 

DISK = 'G:\Backup\YourDB_1.bak',

DISK = 'G:\Backup\YourDB_2.bak',

DISK = 'G:\Backup\YourDB_3.bak',

DISK = 'G:\Backup\YourDB_4.bak'

WITH 

MOVE 'YourDB_Data' TO 'D:\Data\YourDB.mdf',

MOVE 'YourDB_Log' TO 'E:\Logs\YourDB.ldf',

NORECOVERY,

STATS = 5;

```


PART 4: LOG SHIPPING AUTOMATION (CONTINUOUS SYNC)


This is the core automation loop



4.1 SQL: Transaction Log Backup (Source)


```Sample SQL Script 


DECLARE @DBName NVARCHAR(100) = 'YourDB';

DECLARE @Path NVARCHAR(200) = 'G:\Backup\';

DECLARE @FileName NVARCHAR(200);


SET @FileName = @Path + @DBName + '_log_' 

    + REPLACE(CONVERT(VARCHAR, GETDATE(), 120), ':', '-') + '.trn';


BACKUP LOG [YourDB]

TO DISK = @FileName

WITH COMPRESSION, STATS = 5;

```



4.2 PowerShell: Continuous Copy + Restore Loop


Run this on a utility server or EC2


```Sample powershell Script


while ($true)

{

    Write-Host "Starting sync cycle..."


    # Step 1: Backup log (remote execution)


    Invoke-Sqlcmd -ServerInstance $SourceServer -Query "

    BACKUP LOG [$Database]

    TO DISK = '$BackupPath\$Database\_log.trn'

    WITH COMPRESSION;

    "


    # Step 2: Copy logs

---Sample powershell Script

    robocopy $BackupPath $TargetBackupPath *.trn /MT:16 /R:2 /W:2


    # Step 3: Restore logs

---Sample powershell Script


    $files = Get-ChildItem "$TargetBackupPath\*.trn"


    foreach ($file in $files)

    {

        $query = "

        RESTORE LOG [$Database]

        FROM DISK = '$($file.FullName)'

        WITH NORECOVERY;

        "


        Invoke-Sqlcmd -ServerInstance $TargetServer -Query $query


        # Move processed file


        Move-Item $file.FullName "$TargetBackupPath\processed\" -Force

    }


    Start-Sleep -Seconds 300  # 5 minutes

}

```



PART 5: FINAL CUTOVER AUTOMATION


5.1 Final Log Backup (Source)


```Sample SQL Script


BACKUP LOG [YourDB]

TO DISK = 'G:\Backup\YourDB_final.trn'

WITH NORECOVERY;

```



5.2 PowerShell: Final Copy + Restore


```Sample powershell Script


# Copy final log

---Sample powershell Script

robocopy $BackupPath $TargetBackupPath YourDB_final.trn


# Restore final log WITH RECOVERY

---Sample powershell Script

Invoke-Sqlcmd -ServerInstance $TargetServer -Query "

RESTORE LOG [$Database]

FROM DISK = '$TargetBackupPath\YourDB_final.trn'

WITH RECOVERY;

"

```



PART 6: POST-MIGRATION VALIDATION AUTOMATION



6.1 Check Database Status


```Sample powershell Script


Invoke-Sqlcmd -ServerInstance $TargetServer -Query "

SELECT name, state_desc FROM sys.databases WHERE name = '$Database';

"

```



6.2 Run Integrity Check


```Sample powershell Script


Invoke-Sqlcmd -ServerInstance $TargetServer -Query "

DBCC CHECKDB('$Database') WITH NO_INFOMSGS;

"

```



6.3 Update Statistics


```Sample powershell Script


Invoke-Sqlcmd -ServerInstance $TargetServer -Query "

USE [$Database];

EXEC sp_updatestats;

"

```



PART 7: AUTOMATED BACKUP JOB (POST-MIGRATION)



7.1 SQL Agent Job Script


```Sample SQL Script


USE msdb;


EXEC sp_add_job

@job_name = 'FullBackupJob';


EXEC sp_add_jobstep

@job_name = 'FullBackupJob',

@step_name = 'BackupStep',

@subsystem = 'TSQL',

@command = '

BACKUP DATABASE YourDB

TO DISK = ''G:\Backup\YourDB_full.bak''

WITH COMPRESSION;

';


EXEC sp_add_schedule

@schedule_name = 'DailySchedule',

@freq_type = 4,

@freq_interval = 1;


EXEC sp_attach_schedule

@job_name = 'FullBackupJob',

@schedule_name = 'DailySchedule';


EXEC sp_add_jobserver

@job_name = 'FullBackupJob';

```


PART 8: ALWAYS ON AUTOMATION (HADR)



8.1 Enable Always On


```Sample SQL Script


EXEC sp_configure 'show advanced options', 1;

RECONFIGURE;

EXEC sp_configure 'Always On Availability Groups', 1;

RECONFIGURE;

```



8.2 Create Availability Group


```Sample SQL Script


CREATE AVAILABILITY GROUP YourAG

FOR DATABASE YourDB

REPLICA ON 

'N1' WITH (

    ENDPOINT_URL = 'TCP://N1:5022',

    AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,

    FAILOVER_MODE = AUTOMATIC

),

'N2' WITH (

    ENDPOINT_URL = 'TCP://N2:5022',

    AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,

    FAILOVER_MODE = AUTOMATIC

);

```



PART 9: ERROR HANDLING (IMPORTANT)


Add logging in PowerShell:


```Sample powershell Script


try {

    Invoke-Sqlcmd -ServerInstance $TargetServer -Query "SELECT 1"

}

catch {

    Write-Host "Error occurred: $_"

    Out-File -FilePath "C:\error.log" -Append

}

```



PART 10: FULL AUTOMATION FLOW 


Here’s the mental automation pipeline:



STEP 1: Run full backup (SQL)


STEP 2: Copy backup (PowerShell)



STEP 3: Restore (SQL)



STEP 4: Start log loop (PowerShell + SQL)



STEP 5: Final log + recovery



STEP 6: Switch application



STEP 7: Run validation + monitoring


POST-MIGRATION of OnPrem SQL Server to AWS EC2 With VLDBs

 POST-MIGRATION of OnPrem SQL Server to AWS EC2 With VLDBs


Think of this phase as:


“Make the system stable, fast, resilient, and production-ready”


We will go step by step in a logical sequence, just like a real DBA would operate in the first days and weeks after migration.


1. Immediate Post-Migration Validation (First 1–2 Hours)


Before doing anything advanced, confirm the system is correct and usable.



1.1 Verify Database State



SELECT name, state_desc 

FROM sys.databases;

```


Expected Result: `ONLINE`



1.2 Run DBCC CHECKDB (Integrity Check)


This is non-negotiable for mission-critical systems.



DBCC CHECKDB ('YourDB') WITH NO_INFOMSGS, ALL_ERRORMSGS;

```


1.3 Validate Data Consistency


Check:


* Row counts

* Key business tables

* Financial totals (if applicable)


Example:



SELECT COUNT(*) FROM CriticalTable;

```



1.4 Validate Application Connectivity


* Application connects successfully

* No login failures

* No timeout errors



2. Fix Common Post-Migration Issues


---


2.1 Orphan Users



ALTER USER username WITH LOGIN = loginname;

```



2.2 Missing SQL Agent Jobs


Recreate jobs:



USE msdb;

SELECT name FROM sysjobs;

```


2.3 Fix Broken Linked Servers



EXEC sp_addlinkedserver 

    @server = 'LinkedServerName',

    @srvproduct = '',

    @provider = 'SQLNCLI',

    @datasrc = 'RemoteServer';

```



3. Configure BACKUP STRATEGY (CRITICAL)


If you skip this, you are running blind.



3.1 Full Backup Strategy



BACKUP DATABASE YourDB

TO DISK = 'G:\Backup\YourDB_full.bak'

WITH COMPRESSION;

```


Frequency:


* Daily (large DBs: weekly full + differential)



3.2 Differential Backup



BACKUP DATABASE YourDB

TO DISK = 'G:\Backup\YourDB_diff.bak'

WITH DIFFERENTIAL;

```



3.3 Transaction Log Backup



BACKUP LOG YourDB

TO DISK = 'G:\Backup\YourDB_log.trn';

```


Frequency:


* Every 5–15 minutes



3.4 Store Backups in AWS S3


Best practice:


* Copy backups to S3 for durability

* Use lifecycle policies



4. Setup High Availability (HADR)


Now we make the system resilient.



OPTION 1: ALWAYS ON AVAILABILITY GROUPS (BEST PRACTICE)



4.1 Architecture


* Primary replica → EC2 instance 1

* Secondary replica → EC2 instance 2

* Optional → third replica



4.2 Enable Always On



EXEC sp_configure 'show advanced options', 1;

RECONFIGURE;

EXEC sp_configure 'Always On Availability Groups', 1;

RECONFIGURE;

```



4.3 Create Availability Group



CREATE AVAILABILITY GROUP YourAG

FOR DATABASE YourDB

REPLICA ON 

    'PrimaryServer' WITH (

        ENDPOINT_URL = 'TCP://PrimaryServer:5022',

        AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,

        FAILOVER_MODE = AUTOMATIC

    ),

    'SecondaryServer' WITH (

        ENDPOINT_URL = 'TCP://SecondaryServer:5022',

        AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,

        FAILOVER_MODE = AUTOMATIC

    );

```



4.4 Join Secondary Replica



ALTER DATABASE YourDB 

SET HADR AVAILABILITY GROUP = YourAG;

```


4.5 Test Failover



ALTER AVAILABILITY GROUP YourAG FAILOVER;

```


OPTION 2: LOG SHIPPING (Simpler DR)



Setup Steps:


1. Backup logs from primary

2. Copy to secondary

3. Restore with NORECOVERY



OPTION 3: AWS Multi-AZ (Manual Design on EC2)


* Use multiple EC2 instances

* Combine with Always On



5. Performance Tuning (VERY IMPORTANT)


After migration, performance often changes.



5.1 Update Statistics



EXEC sp_updatestats;

```



5.2 Rebuild Indexes



ALTER INDEX ALL ON YourTable REBUILD;

```



5.3 Check Slow Queries



SELECT TOP 10

    total_elapsed_time/execution_count AS AvgTime,

    execution_count

FROM sys.dm_exec_query_stats

ORDER BY AvgTime DESC;

```



5.4 Configure MAXDOP



EXEC sp_configure 'max degree of parallelism', 8;

RECONFIGURE;

```



5.5 Optimize TempDB


* Multiple data files

* Equal size



6. Monitoring and Alerting



6.1 Use AWS CloudWatch


Monitor:


* CPU

* Disk IOPS

* Network



6.2 SQL Server Monitoring Queries



SELECT 

    wait_type, 

    wait_time_ms

FROM sys.dm_os_wait_stats

ORDER BY wait_time_ms DESC;

```


6.3 Enable Alerts


* Disk full

* Job failure

* High CPU



7. Security Hardening



7.1 Disable Unused Accounts



ALTER LOGIN username DISABLE;

```



7.2 Enforce Encryption


* Use SSL connections

* Enable TDE if required



7.3 Restrict Ports


Only allow:


* 1433 (SQL)

* 3389 (RDP, restricted)



8. Cost Optimization (AWS-Specific)



8.1 Right-Size EC2


* Avoid over-provisioning

* Monitor usage



8.2 Optimize Storage


* Use gp3 if possible

* Reduce unused volumes



8.3 Stop Non-Production Servers



9. Disaster Recovery Planning



9.1 Cross-Region Backup


* Copy backups to another AWS region



9.2 Test Restore Regularly


Never assume backups work—test them.



9.3 Define RPO and RTO


* RPO → Data loss tolerance

* RTO → Recovery time



10. Real DBA Daily Operations


After migration, your daily job includes:



Daily Tasks


* Check backups

* Check failed jobs

* Monitor performance



Weekly Tasks


* Index maintenance

* Review slow queries



Monthly Tasks


* Capacity planning

* Cost review



11. Troubleshooting Common Issues




## Issue: High CPU


Fix:


* Identify query

* Optimize index


---


## Issue: Slow Disk


Fix:


* Increase IOPS

* Move to faster storage


---


## Issue: Blocking



SELECT * FROM sys.dm_exec_requests;

```


---


## Issue: Deadlocks


Enable trace:



DBCC TRACEON (1222, -1);

```


---


12. Final Mental Model (Production Readiness)


You are production-ready if:


* Backups are running

* HA is configured

* Monitoring is active

* Performance is stable

* DR is tested


---

SUMMARY


BEFORE MIGRATION:


* Inventory complete

* AWS ready

* Backup tested

* Migration plan documented


---


DURING MIGRATION:


* Full restore completed

* Logs continuously applied

* Final cutover executed

* Application switched


---


AFTER MIGRATION:


* DB validated

* Backups configured

* HA implemented

* Performance tuned

* Monitoring active


Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB)

 Real-world Failure Scenarios During large OnPrem SQL Server to Aws EC2 Migrations (especially 10–100TB) More importantly: *How to detect Fa...