BUILDING PRODUCTION MONITORING & ALERTING SYSTEM After Migrating From On-prem SQL Server To SQL Server on AWS EC2
This guides to build a production-ready monitoring system that can:
* Detects problems early
* Alerts the right people
* Helps you diagnose fast
* Prevents outages
BIG PICTURE ARCHITECTURE
You are building 4 layers:
1. SQL Server Internal Monitoring (inside SQL Server)
2. OS + EC2 Monitoring(via AWS)
3. Central Logging & Alerting (SQL Agent + email + logs)
4. Response Automation(PowerShell + alerts)
WHAT YOU MUST MONITOR (CRITICAL SIGNALS)
DATABASE HEALTH:
* Database state (ONLINE/OFFLINE)
* Corruption
* Backup success/failure
PERFORMANCE:
* CPU usage
* Slow queries
* Blocking / deadlocks
STORAGE:
* Disk space
* IOPS latency
AVAILABILITY:
* Always On status
* Connection failures
PART 1: SQL SERVER MONITORING (CORE)
1.1 Create Monitoring Database
```sql
CREATE DATABASE DBA_Monitoring;
GO
```
---
1.2 Table for Alerts
```sql
USE DBA_Monitoring;
CREATE TABLE Alerts (
AlertID INT IDENTITY(1,1),
AlertType NVARCHAR(100),
Message NVARCHAR(MAX),
Severity INT,
CreatedAt DATETIME DEFAULT GETDATE()
);
```
PART 2: CRITICAL MONITORING QUERIES
2.1 Check Database Status
```sql
INSERT INTO DBA_Monitoring.dbo.Alerts (AlertType, Message, Severity)
SELECT
'Database Offline',
name + ' is not ONLINE',
1
FROM sys.databases
WHERE state_desc <> 'ONLINE';
```
2.2 Check Failed Backups
```sql
INSERT INTO DBA_Monitoring.dbo.Alerts (AlertType, Message, Severity)
SELECT
'Backup Failure',
'No backup in last 24 hours',
1
WHERE NOT EXISTS (
SELECT 1 FROM msdb.dbo.backupset
WHERE backup_finish_date > DATEADD(HOUR, -24, GETDATE())
);
```
2.3 Check Disk Space
```sql
EXEC xp_fixeddrives;
```
Alert if free space < 20%
2.4 Check Blocking
```sql
SELECT blocking_session_id, session_id
FROM sys.dm_exec_requests
WHERE blocking_session_id <> 0;
```
2.5 Check Long Running Queries
```sql
SELECT TOP 10
total_elapsed_time / execution_count AS AvgTime,
execution_count
FROM sys.dm_exec_query_stats
ORDER BY AvgTime DESC;
```
2.6 Deadlock Detection
```sql
DBCC TRACEON (1222, -1);
```
PART 3: SQL AGENT ALERT JOB
3.1 Create Monitoring Job
```sql
USE msdb;
EXEC sp_add_job @job_name = 'DBA_Monitoring_Job';
```
3.2 Add Step
```sql
EXEC sp_add_jobstep
@job_name = 'DBA_Monitoring_Job',
@step_name = 'Run Checks',
@subsystem = 'TSQL',
@command = '
EXEC check_script for Database Status;
EXEC check_script for Backup Failure;
';
```
3.3 Schedule Every 5 Minutes
```sql
EXEC sp_add_schedule
@schedule_name = 'Every5Min',
@freq_type = 4,
@freq_subday_type = 4,
@freq_subday_interval = 5;
```
PART 4: EMAIL ALERTING SYSTEM
4.1 Configure Database Mail
```sql
EXEC sp_configure 'Database Mail XPs', 1;
RECONFIGURE;
```
4.2 Send Alert Email
```sql
EXEC msdb.dbo.sp_send_dbmail
@profile_name = 'DBA_Profile',
@recipients = 'dba@company.com',
@subject = 'SQL Alert',
@body = 'Critical issue detected';
```
4.3 Auto Email on Alert
```sql
IF EXISTS (SELECT 1 FROM DBA_Monitoring.dbo.Alerts WHERE Severity = 1)
BEGIN
EXEC msdb.dbo.sp_send_dbmail
@profile_name = 'DBA_Profile',
@recipients = 'dba@company.com',
@subject = 'CRITICAL SQL ALERT',
@body = 'Check DBA_Monitoring database immediately';
END
```
PART 5: AWS CLOUDWATCH MONITORING
5.1 Key Metrics to Enable
Monitor in AWS:
* CPUUtilization
* DiskReadOps / WriteOps
* NetworkIn / Out
5.2 CloudWatch Alarm Example
Trigger alert if:
* CPU > 80% for 5 minutes
5.3 Disk Alert
Trigger if:
* Free space < 15%
PART 6: POWERSHELL ALERT AUTOMATION
6.1 Check Disk Space Script
```powershell
$drives = Get-PSDrive -PSProvider FileSystem
foreach ($drive in $drives) {
if ($drive.Free -lt 20GB) {
Write-Host "Low disk space on $($drive.Name)"
}
}
```
6.2 Send Alert
```powershell
Send-MailMessage -To "dba@company.com" `
-From "alert@company.com" `
-Subject "Disk Alert" `
-Body "Low disk space detected" `
-SmtpServer "smtp.server.com"
```
PART 7: ALWAYS ON MONITORING
7.1 Check Replica Health
```sql
SELECT
replica_server_name,
synchronization_state_desc
FROM sys.dm_hadr_database_replica_states;
```
Alert if: NOT SYNCHRONIZED
PART 8: ALERT SEVERITY MODEL
Severity Levels
* 1 (Critical) → Immediate action (DB down, corruption)
* 2 (Warning) → Investigate (slow queries)
* 3 (Info) → Logging only
ART 9: RESPONSE AUTOMATION
Example: Restart Failed Job
```powershell
Invoke-Sqlcmd -Query "EXEC msdb.dbo.sp_start_job 'JobName'"
```
Example: Kill Blocking Session
```sql
KILL <session_id>;
```
PART 10: DAILY DBA DASHBOARD QUERY
```sql
SELECT TOP 50 *
FROM DBA_Monitoring.dbo.Alerts
ORDER BY CreatedAt DESC;
```
PRODUCTION CHECKLIST
You are production-ready if:
SQL Level
* Monitoring job running
* Alerts being logged
* Emails working
AWS Level
* CloudWatch alarms active
* Disk + CPU monitored
Response Level
* Scripts tested
* Alerts actionable
---
Most outages happen because there is No alert or Alert ignored or Alert unclear.
No comments:
Post a Comment