Thursday, April 2, 2026

Enterprise-Scale SQL Server Common Production Troubleshooting and Architectural Design scenarios

Enterprise-Scale SQL Server Common Production Troubleshooting and Architectural  Design scenarios

Below are real-world, production-style troubleshooting scenarios designed to make you think and act like a database performance physician in live environments. Each case follows a medical workflow.

πŸ₯ SCENARIO 1: “The Slow Morning Report” (High CPU Issue)

🚨 Symptom (Patient Complaint)

  • Every morning at 9 AM, reports become extremely slow

  • CPU spikes to 90–100%

  • Users complain system is “frozen”


πŸ” Step 1: Initial Assessment

Run:

EXEC sp_who2;
SELECT TOP 5 
    total_worker_time/execution_count AS avg_cpu,
    execution_count,
    text
FROM sys.dm_exec_query_stats
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
ORDER BY avg_cpu DESC;

🧠 Think First

πŸ‘‰ What is causing CPU spike only at 9 AM?


πŸ§ͺ Diagnosis

You discover:

  • A reporting query runs every morning

  • Query scans millions of rows

  • Uses SELECT *

  • No proper indexes


πŸ’Š Treatment

Fix 1: Create covering index

CREATE NONCLUSTERED INDEX idx_report
ON Sales(OrderDate)
INCLUDE (CustomerID, TotalAmount);

Fix 2: Optimize query

SELECT CustomerID, SUM(TotalAmount)
FROM Sales
WHERE OrderDate >= '2025-01-01'
GROUP BY CustomerID;

πŸ“Š Monitoring

SELECT *
FROM sys.dm_exec_query_stats;

πŸ›‘ Prevention

  • Schedule reports off-peak

  • Use Query Store

  • Avoid SELECT *


✅ Outcome

  • CPU drops from 95% → 40%

  • Report runs in seconds


πŸ₯ SCENARIO 2: “The Frozen Checkout System” (Blocking Issue)

🚨 Symptom

  • Users cannot complete transactions

  • Application “hangs”

  • Queries waiting indefinitely


πŸ” Step 1: Check Blocking

SELECT blocking_session_id, wait_type, wait_time, session_id
FROM sys.dm_exec_requests
WHERE blocking_session_id <> 0;

🧠 Think

πŸ‘‰ What causes blocking in SQL Server?


πŸ§ͺ Diagnosis

You find:

  • One transaction running for 10 minutes

  • It is holding locks

  • Other queries are waiting


πŸ’Š Treatment

Immediate Fix

KILL <blocking_session_id>;

Root Fix

Bad code:

BEGIN TRAN
UPDATE Orders SET Status = 'Processed'
-- No COMMIT

Fix:

BEGIN TRAN
UPDATE Orders SET Status = 'Processed'
COMMIT;

πŸ“Š Monitoring

SELECT * FROM sys.dm_tran_locks;

πŸ›‘ Prevention

  • Keep transactions short

  • Use proper indexing

  • Use READ COMMITTED SNAPSHOT


✅ Outcome

  • System unfreezes

  • Users can transact


πŸ₯ SCENARIO 3: “The Disk is Always Full” (Storage & Growth Issue)

🚨 Symptom

  • Database size growing rapidly

  • Disk almost full

  • System slowdown


πŸ” Step 1: Check Database Size

EXEC sp_spaceused;

🧠 Think

πŸ‘‰ What grows a database?


πŸ§ͺ Diagnosis

You discover:

  • Large unused indexes

  • Old data not archived

  • Transaction log not shrinking


πŸ’Š Treatment

Fix 1: Remove unused index

DROP INDEX idx_unused ON Orders;

Fix 2: Backup and shrink log

BACKUP LOG YourDB TO DISK = 'log.bak';
DBCC SHRINKFILE (YourDB_log, 1000);

Fix 3: Archive old data

DELETE FROM Orders
WHERE OrderDate < '2020-01-01';

πŸ“Š Monitoring

SELECT * FROM sys.master_files;

πŸ›‘ Prevention

  • Regular cleanup jobs

  • Monitor file growth

  • Use proper autogrowth settings


✅ Outcome

  • Disk usage reduced

  • Performance improves


πŸ₯ SCENARIO 4: “Random Slow Queries” (Parameter Sniffing)

🚨 Symptom

  • Same query sometimes fast, sometimes slow

  • No code changes


πŸ” Step 1: Capture Query Plan

Enable execution plan and run query multiple times.


🧠 Think

πŸ‘‰ Why inconsistent performance?


πŸ§ͺ Diagnosis

  • Parameter sniffing issue

  • SQL reuses bad execution plan


πŸ’Š Treatment

Fix 1: OPTION RECOMPILE

SELECT *
FROM Orders
WHERE CustomerID = @CustomerID
OPTION (RECOMPILE);

Fix 2: Use local variable

DECLARE @C INT = @CustomerID;

SELECT *
FROM Orders
WHERE CustomerID = @C;

πŸ“Š Monitoring

Use Query Store to compare plans.


πŸ›‘ Prevention

  • Use OPTIMIZE FOR

  • Monitor plan cache


✅ Outcome

  • Stable performance


πŸ₯ SCENARIO 5: “Deadlock Crisis” (Concurrency Emergency)

🚨 Symptom

  • Errors: “Deadlock victim”

  • Transactions fail randomly


πŸ” Step 1: Capture Deadlock

SELECT * FROM sys.dm_tran_locks;

🧠 Think

πŸ‘‰ What causes deadlocks?


πŸ§ͺ Diagnosis

Two queries:

  • Access tables in different order


πŸ’Š Treatment

Fix 1: Standardize access order

-- Always access Customers first, then Orders

Fix 2: Use deadlock priority

SET DEADLOCK_PRIORITY LOW;

πŸ“Š Monitoring

Use Extended Events.


πŸ›‘ Prevention

  • Consistent query patterns

  • Proper indexing


✅ Outcome

  • Deadlocks eliminated


πŸ₯ SCENARIO 6: “The Silent Performance Killer” (Missing Indexes)

🚨 Symptom

  • Gradual slowdown

  • No obvious issue


πŸ” Step 1: Check Missing Indexes

SELECT *
FROM sys.dm_db_missing_index_details;

🧠 Think

πŸ‘‰ Are all indexes helpful?


πŸ§ͺ Diagnosis

  • Important indexes missing


πŸ’Š Treatment

CREATE NONCLUSTERED INDEX idx_missing
ON Orders(CustomerID);

πŸ“Š Monitoring

Compare query performance before/after.


πŸ›‘ Prevention

  • Regular index review


✅ Outcome

  • Faster queries


πŸ₯ SCENARIO 7: “The Memory Pressure Case”

🚨 Symptom

  • Queries slow under load

  • High PAGE LIFE EXPECTANCY drop


πŸ” Step 1: Check Memory

SELECT * FROM sys.dm_os_sys_memory;

🧠 Think

πŸ‘‰ Is memory sufficient?


πŸ§ͺ Diagnosis

  • Memory pressure

  • Too many large queries


πŸ’Š Treatment

  • Add RAM

  • Optimize queries

  • Reduce cache bloat


πŸ›‘ Prevention

  • Monitor memory usage


✅ Outcome

  • Stable performance


2, Building eEnterprise-scale Architectural Design scenarios


🏒 SCENARIO 1: Global E-Commerce Platform (High Throughput OLTP System)


🧬 System Anatomy

  • 50,000+ concurrent users

  • OLTP workload (orders, payments, carts)

  • SQL Server primary database (~2 TB)

  • Microservices architecture


🚨 Symptoms

  • Checkout latency spikes during peak traffic

  • Deadlocks increasing

  • CPU consistently above 85%


πŸ” Investigation

Run:

SELECT TOP 10 *
FROM sys.dm_exec_requests
ORDER BY cpu_time DESC;

Check waits:

SELECT TOP 10 *
FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC;

πŸ§ͺ Diagnosis

  • Heavy write contention on Orders table

  • Hotspot index (clustered index on identity column)

  • Deadlocks due to concurrent updates


πŸ’Š Treatment (Immediate Fixes)

1. Add Row Versioning

ALTER DATABASE EcommerceDB
SET READ_COMMITTED_SNAPSHOT ON;

2. Optimize Index Design

CREATE NONCLUSTERED INDEX idx_orders_customer
ON Orders(CustomerID)
INCLUDE (OrderDate, Status);

3. Reduce Transaction Scope

  • Break large transactions into smaller ones


πŸ“Š Monitoring

  • Query Store

  • Extended Events for deadlocks


πŸ›‘ Prevention

  • Use retry logic in application

  • Limit transaction time


πŸ— Architecture Upgrade (Enterprise Level)

  • Implement read replicas (Always On Availability Groups)

  • Introduce caching layer (e.g., Redis)

  • Use queue-based writes for heavy operations


✅ Outcome

  • 70% reduction in deadlocks

  • 40% faster checkout


🏦 SCENARIO 2: Banking System (High Consistency + Security)


🧬 System Anatomy

  • Financial transactions (ACID critical)

  • 24/7 uptime requirement

  • Strict auditing


🚨 Symptoms

  • Slow transaction processing

  • Locking and blocking

  • Audit queries slow


πŸ” Investigation

SELECT * FROM sys.dm_tran_locks;

πŸ§ͺ Diagnosis

  • Overuse of SERIALIZABLE isolation

  • Large audit table without partitioning


πŸ’Š Treatment

1. Optimize Isolation Levels

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

2. Partition Audit Table

CREATE PARTITION FUNCTION pf_AuditDate (DATE)
AS RANGE RIGHT FOR VALUES ('2024-01-01', '2025-01-01');

3. Add Filtered Index

CREATE INDEX idx_audit_recent
ON AuditLogs(Date)
WHERE Date > '2025-01-01';

πŸ“Š Monitoring

  • Transaction latency

  • Lock wait time


πŸ›‘ Prevention

  • Periodic archiving

  • Index review


πŸ— Architecture Upgrade

  • Use Always On Availability Groups

  • Separate OLTP and reporting databases

  • Implement data encryption (TDE)


✅ Outcome

  • Faster transactions

  • Reduced blocking


πŸ₯ SCENARIO 3: Healthcare Data System (Hybrid OLTP + Analytics)


🧬 System Anatomy

  • Patient records + analytics queries

  • Mixed workload (reads + writes)

  • Large historical data


🚨 Symptoms

  • Reports slow during business hours

  • Doctors complain system lags


πŸ” Investigation

SELECT *
FROM sys.dm_exec_query_stats;

πŸ§ͺ Diagnosis

  • Analytical queries competing with OLTP

  • Table scans on large datasets


πŸ’Š Treatment

1. Create Columnstore Index

CREATE CLUSTERED COLUMNSTORE INDEX cci_visits
ON Visits;

2. Offload Reporting

  • Use read replica


3. Data Partitioning

CREATE PARTITION SCHEME ps_Visits
AS PARTITION pf_AuditDate ALL TO ([PRIMARY]);

πŸ“Š Monitoring

  • Query duration

  • IO usage


πŸ›‘ Prevention

  • Separate workloads

  • Schedule heavy queries


πŸ— Architecture Upgrade

  • Introduce data warehouse (ETL pipeline)

  • Use Azure Synapse / Fabric (if cloud)


✅ Outcome

  • Reports run faster

  • OLTP unaffected


πŸ“‘ SCENARIO 4: Telecom System (Massive Data Ingestion)


🧬 System Anatomy

  • Millions of records per minute

  • Logging, call records, events

  • Append-only workload


🚨 Symptoms

  • Insert performance slowing

  • Log file growing rapidly


πŸ” Investigation

SELECT * FROM sys.dm_db_log_space_usage;

πŸ§ͺ Diagnosis

  • Transaction log bottleneck

  • Too many indexes slowing inserts


πŸ’Š Treatment

1. Minimal Index Strategy

  • Drop unnecessary indexes


2. Batch Inserts

INSERT INTO Logs
SELECT * FROM StagingTable;

3. Switch to SIMPLE recovery (if safe)

ALTER DATABASE TelecomDB SET RECOVERY SIMPLE;

πŸ“Š Monitoring

  • Insert throughput

  • Log usage


πŸ›‘ Prevention

  • Use staging tables

  • Regular log backups


πŸ— Architecture Upgrade

  • Use partitioned tables

  • Implement data streaming (Kafka, Event Hub)

  • Archive old data


✅ Outcome

  • Faster ingestion

  • Stable log growth


☁️ SCENARIO 5: SaaS Multi-Tenant System


🧬 System Anatomy

  • Thousands of tenants

  • Shared database

  • Mixed workloads


🚨 Symptoms

  • One tenant slows entire system

  • Uneven performance


πŸ” Investigation

SELECT *
FROM sys.dm_exec_sessions;

πŸ§ͺ Diagnosis

  • No tenant isolation

  • Resource contention


πŸ’Š Treatment

1. Add Tenant Filtering Index

CREATE INDEX idx_tenant
ON Orders(TenantID);

2. Resource Governor

  • Limit heavy tenants


3. Query Optimization per tenant


πŸ“Š Monitoring

  • Per-tenant performance


πŸ›‘ Prevention

  • Workload isolation


πŸ— Architecture Upgrade

  • Move to database-per-tenant model

  • Use elastic pools (cloud)

  • Implement sharding


✅ Outcome

  • Fair resource usage

  • Predictable performance


🌐 SCENARIO 6: Disaster Recovery & High Availability


🧬 System Anatomy

  • Mission-critical system

  • 99.99% uptime required


🚨 Symptoms

  • Server crash

  • Data loss risk


πŸ” Investigation

  • Check backup history


πŸ§ͺ Diagnosis

  • No proper DR strategy


πŸ’Š Treatment

1. Setup Always On

-- Configure Availability Group (conceptual)

2. Regular Backups

BACKUP DATABASE YourDB TO DISK = 'full.bak';

3. Test restore


πŸ“Š Monitoring

  • Failover readiness


πŸ›‘ Prevention

  • DR drills

  • Backup verification


πŸ— Architecture Upgrade

  • Multi-region replication

  • Automated failover


✅ Outcome

  • Zero/low downtime

  • Business continuity

No comments:

Post a Comment

MINUTE BY MINUITE PRODUCTION RUNBOOK FOR FULLY AUTOMATED MIGRATION FROM SAP ASE TO SQL Server Azure VM

MINUTE BY MINUITE PRODUCTION RUNBOOK FOR  FULLY AUTOMATED MIGRATION FROM SAP ASE TO SQL Server Azure VM --- OVERALL STRUCTURE Breaking execu...