Wednesday, May 6, 2026

BUILDING MONITORING & ALERTING SYSTEM After Migrating From On-prem SQL Server To SQL Server on AWS EC2

 BUILDING PRODUCTION MONITORING & ALERTING SYSTEM After Migrating From On-prem SQL Server To SQL Server on AWS EC2



This guides to build a production-ready monitoring system that can:


* Detects problems early


* Alerts the right people


* Helps you diagnose fast


* Prevents outages



BIG PICTURE ARCHITECTURE


You are building 4 layers:


1. SQL Server Internal Monitoring (inside SQL Server)


2. OS + EC2 Monitoring(via AWS)


3. Central Logging & Alerting (SQL Agent + email + logs)


4. Response Automation(PowerShell + alerts)




WHAT YOU MUST MONITOR (CRITICAL SIGNALS)




DATABASE HEALTH:


* Database state (ONLINE/OFFLINE)


* Corruption


* Backup success/failure




PERFORMANCE:


* CPU usage


* Slow queries


* Blocking / deadlocks




STORAGE:


* Disk space


* IOPS latency




AVAILABILITY:


* Always On status


* Connection failures




PART 1: SQL SERVER MONITORING (CORE)




1.1 Create Monitoring Database


```sql 


CREATE DATABASE DBA_Monitoring;

GO

```


---


1.2 Table for Alerts


```sql 


USE DBA_Monitoring;


CREATE TABLE Alerts (

    AlertID INT IDENTITY(1,1),

    AlertType NVARCHAR(100),

    Message NVARCHAR(MAX),

    Severity INT,

    CreatedAt DATETIME DEFAULT GETDATE()

);

```




PART 2: CRITICAL MONITORING QUERIES




2.1 Check Database Status


```sql 


INSERT INTO DBA_Monitoring.dbo.Alerts (AlertType, Message, Severity)

SELECT 

    'Database Offline',

    name + ' is not ONLINE',

    1

FROM sys.databases

WHERE state_desc <> 'ONLINE';

```




2.2 Check Failed Backups


```sql 


INSERT INTO DBA_Monitoring.dbo.Alerts (AlertType, Message, Severity)

SELECT 

    'Backup Failure',

    'No backup in last 24 hours',

    1

WHERE NOT EXISTS (

    SELECT 1 FROM msdb.dbo.backupset

    WHERE backup_finish_date > DATEADD(HOUR, -24, GETDATE())

);

```




2.3 Check Disk Space


```sql 


EXEC xp_fixeddrives;

```


Alert if free space < 20%




2.4 Check Blocking


```sql 


SELECT blocking_session_id, session_id

FROM sys.dm_exec_requests

WHERE blocking_session_id <> 0;

```




2.5 Check Long Running Queries


```sql 


SELECT TOP 10

    total_elapsed_time / execution_count AS AvgTime,

    execution_count

FROM sys.dm_exec_query_stats

ORDER BY AvgTime DESC;

```




2.6 Deadlock Detection


```sql 


DBCC TRACEON (1222, -1);

```




PART 3: SQL AGENT ALERT JOB




3.1 Create Monitoring Job


```sql 


USE msdb;


EXEC sp_add_job @job_name = 'DBA_Monitoring_Job';

```


3.2 Add Step


```sql 


EXEC sp_add_jobstep

@job_name = 'DBA_Monitoring_Job',

@step_name = 'Run Checks',

@subsystem = 'TSQL',

@command = '

EXEC check_script for Database Status;

EXEC check_script for Backup Failure;

';

```


3.3 Schedule Every 5 Minutes


```sql 


EXEC sp_add_schedule

@schedule_name = 'Every5Min',

@freq_type = 4,

@freq_subday_type = 4,

@freq_subday_interval = 5;

```




PART 4: EMAIL ALERTING SYSTEM


4.1 Configure Database Mail


```sql 


EXEC sp_configure 'Database Mail XPs', 1;

RECONFIGURE;

```


4.2 Send Alert Email


```sql 


EXEC msdb.dbo.sp_send_dbmail

@profile_name = 'DBA_Profile',

@recipients = 'dba@company.com',

@subject = 'SQL Alert',

@body = 'Critical issue detected';

```


4.3 Auto Email on Alert


```sql 


IF EXISTS (SELECT 1 FROM DBA_Monitoring.dbo.Alerts WHERE Severity = 1)

BEGIN

    EXEC msdb.dbo.sp_send_dbmail

    @profile_name = 'DBA_Profile',

    @recipients = 'dba@company.com',

    @subject = 'CRITICAL SQL ALERT',

    @body = 'Check DBA_Monitoring database immediately';

END

```




PART 5: AWS CLOUDWATCH MONITORING


5.1 Key Metrics to Enable


Monitor in AWS:


* CPUUtilization

* DiskReadOps / WriteOps

* NetworkIn / Out


5.2 CloudWatch Alarm Example


Trigger alert if:


* CPU > 80% for 5 minutes


5.3 Disk Alert


Trigger if:


* Free space < 15%




PART 6: POWERSHELL ALERT AUTOMATION


6.1 Check Disk Space Script


```powershell 


$drives = Get-PSDrive -PSProvider FileSystem


foreach ($drive in $drives) {

    if ($drive.Free -lt 20GB) {

        Write-Host "Low disk space on $($drive.Name)"

    }

}

```


6.2 Send Alert


```powershell 


Send-MailMessage -To "dba@company.com" `

-From "alert@company.com" `

-Subject "Disk Alert" `

-Body "Low disk space detected" `

-SmtpServer "smtp.server.com"

```




PART 7: ALWAYS ON MONITORING


7.1 Check Replica Health


```sql 


SELECT 

    replica_server_name,

    synchronization_state_desc

FROM sys.dm_hadr_database_replica_states;

```


Alert if: NOT SYNCHRONIZED




PART 8: ALERT SEVERITY MODEL


Severity Levels


* 1 (Critical) → Immediate action (DB down, corruption)


* 2 (Warning) → Investigate (slow queries)


* 3 (Info) → Logging only




ART 9: RESPONSE AUTOMATION


Example: Restart Failed Job


```powershell 


Invoke-Sqlcmd -Query "EXEC msdb.dbo.sp_start_job 'JobName'"

```


Example: Kill Blocking Session


```sql 


KILL <session_id>;

```




PART 10: DAILY DBA DASHBOARD QUERY



```sql 


SELECT TOP 50 *

FROM DBA_Monitoring.dbo.Alerts

ORDER BY CreatedAt DESC;

```


PRODUCTION CHECKLIST


You are production-ready if:



SQL Level


* Monitoring job running


* Alerts being logged


* Emails working


AWS Level


* CloudWatch alarms active

* Disk + CPU monitored



Response Level


* Scripts tested


* Alerts actionable


---


Most outages happen because there is No alert or Alert ignored or Alert unclear.


No comments:

Post a Comment

MINUTE BY MINUITE PRODUCTION RUNBOOK FOR FULLY AUTOMATED MIGRATION FROM SAP ASE TO SQL Server Azure VM

MINUTE BY MINUITE PRODUCTION RUNBOOK FOR  FULLY AUTOMATED MIGRATION FROM SAP ASE TO SQL Server Azure VM --- OVERALL STRUCTURE Breaking execu...