Thursday, April 2, 2026

Enterprise-Scale SQL Server Common Production Troubleshooting and Architectural Design scenarios

Enterprise-Scale SQL Server Common Production Troubleshooting and Architectural  Design scenarios

Below are real-world, production-style troubleshooting scenarios designed to make you think and act like a database performance physician in live environments. Each case follows a medical workflow.

πŸ₯ SCENARIO 1: “The Slow Morning Report” (High CPU Issue)

🚨 Symptom (Patient Complaint)

  • Every morning at 9 AM, reports become extremely slow

  • CPU spikes to 90–100%

  • Users complain system is “frozen”


πŸ” Step 1: Initial Assessment

Run:

EXEC sp_who2;
SELECT TOP 5 
    total_worker_time/execution_count AS avg_cpu,
    execution_count,
    text
FROM sys.dm_exec_query_stats
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
ORDER BY avg_cpu DESC;

🧠 Think First

πŸ‘‰ What is causing CPU spike only at 9 AM?


πŸ§ͺ Diagnosis

You discover:

  • A reporting query runs every morning

  • Query scans millions of rows

  • Uses SELECT *

  • No proper indexes


πŸ’Š Treatment

Fix 1: Create covering index

CREATE NONCLUSTERED INDEX idx_report
ON Sales(OrderDate)
INCLUDE (CustomerID, TotalAmount);

Fix 2: Optimize query

SELECT CustomerID, SUM(TotalAmount)
FROM Sales
WHERE OrderDate >= '2025-01-01'
GROUP BY CustomerID;

πŸ“Š Monitoring

SELECT *
FROM sys.dm_exec_query_stats;

πŸ›‘ Prevention

  • Schedule reports off-peak

  • Use Query Store

  • Avoid SELECT *


✅ Outcome

  • CPU drops from 95% → 40%

  • Report runs in seconds


πŸ₯ SCENARIO 2: “The Frozen Checkout System” (Blocking Issue)

🚨 Symptom

  • Users cannot complete transactions

  • Application “hangs”

  • Queries waiting indefinitely


πŸ” Step 1: Check Blocking

SELECT blocking_session_id, wait_type, wait_time, session_id
FROM sys.dm_exec_requests
WHERE blocking_session_id <> 0;

🧠 Think

πŸ‘‰ What causes blocking in SQL Server?


πŸ§ͺ Diagnosis

You find:

  • One transaction running for 10 minutes

  • It is holding locks

  • Other queries are waiting


πŸ’Š Treatment

Immediate Fix

KILL <blocking_session_id>;

Root Fix

Bad code:

BEGIN TRAN
UPDATE Orders SET Status = 'Processed'
-- No COMMIT

Fix:

BEGIN TRAN
UPDATE Orders SET Status = 'Processed'
COMMIT;

πŸ“Š Monitoring

SELECT * FROM sys.dm_tran_locks;

πŸ›‘ Prevention

  • Keep transactions short

  • Use proper indexing

  • Use READ COMMITTED SNAPSHOT


✅ Outcome

  • System unfreezes

  • Users can transact


πŸ₯ SCENARIO 3: “The Disk is Always Full” (Storage & Growth Issue)

🚨 Symptom

  • Database size growing rapidly

  • Disk almost full

  • System slowdown


πŸ” Step 1: Check Database Size

EXEC sp_spaceused;

🧠 Think

πŸ‘‰ What grows a database?


πŸ§ͺ Diagnosis

You discover:

  • Large unused indexes

  • Old data not archived

  • Transaction log not shrinking


πŸ’Š Treatment

Fix 1: Remove unused index

DROP INDEX idx_unused ON Orders;

Fix 2: Backup and shrink log

BACKUP LOG YourDB TO DISK = 'log.bak';
DBCC SHRINKFILE (YourDB_log, 1000);

Fix 3: Archive old data

DELETE FROM Orders
WHERE OrderDate < '2020-01-01';

πŸ“Š Monitoring

SELECT * FROM sys.master_files;

πŸ›‘ Prevention

  • Regular cleanup jobs

  • Monitor file growth

  • Use proper autogrowth settings


✅ Outcome

  • Disk usage reduced

  • Performance improves


πŸ₯ SCENARIO 4: “Random Slow Queries” (Parameter Sniffing)

🚨 Symptom

  • Same query sometimes fast, sometimes slow

  • No code changes


πŸ” Step 1: Capture Query Plan

Enable execution plan and run query multiple times.


🧠 Think

πŸ‘‰ Why inconsistent performance?


πŸ§ͺ Diagnosis

  • Parameter sniffing issue

  • SQL reuses bad execution plan


πŸ’Š Treatment

Fix 1: OPTION RECOMPILE

SELECT *
FROM Orders
WHERE CustomerID = @CustomerID
OPTION (RECOMPILE);

Fix 2: Use local variable

DECLARE @C INT = @CustomerID;

SELECT *
FROM Orders
WHERE CustomerID = @C;

πŸ“Š Monitoring

Use Query Store to compare plans.


πŸ›‘ Prevention

  • Use OPTIMIZE FOR

  • Monitor plan cache


✅ Outcome

  • Stable performance


πŸ₯ SCENARIO 5: “Deadlock Crisis” (Concurrency Emergency)

🚨 Symptom

  • Errors: “Deadlock victim”

  • Transactions fail randomly


πŸ” Step 1: Capture Deadlock

SELECT * FROM sys.dm_tran_locks;

🧠 Think

πŸ‘‰ What causes deadlocks?


πŸ§ͺ Diagnosis

Two queries:

  • Access tables in different order


πŸ’Š Treatment

Fix 1: Standardize access order

-- Always access Customers first, then Orders

Fix 2: Use deadlock priority

SET DEADLOCK_PRIORITY LOW;

πŸ“Š Monitoring

Use Extended Events.


πŸ›‘ Prevention

  • Consistent query patterns

  • Proper indexing


✅ Outcome

  • Deadlocks eliminated


πŸ₯ SCENARIO 6: “The Silent Performance Killer” (Missing Indexes)

🚨 Symptom

  • Gradual slowdown

  • No obvious issue


πŸ” Step 1: Check Missing Indexes

SELECT *
FROM sys.dm_db_missing_index_details;

🧠 Think

πŸ‘‰ Are all indexes helpful?


πŸ§ͺ Diagnosis

  • Important indexes missing


πŸ’Š Treatment

CREATE NONCLUSTERED INDEX idx_missing
ON Orders(CustomerID);

πŸ“Š Monitoring

Compare query performance before/after.


πŸ›‘ Prevention

  • Regular index review


✅ Outcome

  • Faster queries


πŸ₯ SCENARIO 7: “The Memory Pressure Case”

🚨 Symptom

  • Queries slow under load

  • High PAGE LIFE EXPECTANCY drop


πŸ” Step 1: Check Memory

SELECT * FROM sys.dm_os_sys_memory;

🧠 Think

πŸ‘‰ Is memory sufficient?


πŸ§ͺ Diagnosis

  • Memory pressure

  • Too many large queries


πŸ’Š Treatment

  • Add RAM

  • Optimize queries

  • Reduce cache bloat


πŸ›‘ Prevention

  • Monitor memory usage


✅ Outcome

  • Stable performance


2, Building eEnterprise-scale Architectural Design scenarios


🏒 SCENARIO 1: Global E-Commerce Platform (High Throughput OLTP System)


🧬 System Anatomy

  • 50,000+ concurrent users

  • OLTP workload (orders, payments, carts)

  • SQL Server primary database (~2 TB)

  • Microservices architecture


🚨 Symptoms

  • Checkout latency spikes during peak traffic

  • Deadlocks increasing

  • CPU consistently above 85%


πŸ” Investigation

Run:

SELECT TOP 10 *
FROM sys.dm_exec_requests
ORDER BY cpu_time DESC;

Check waits:

SELECT TOP 10 *
FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC;

πŸ§ͺ Diagnosis

  • Heavy write contention on Orders table

  • Hotspot index (clustered index on identity column)

  • Deadlocks due to concurrent updates


πŸ’Š Treatment (Immediate Fixes)

1. Add Row Versioning

ALTER DATABASE EcommerceDB
SET READ_COMMITTED_SNAPSHOT ON;

2. Optimize Index Design

CREATE NONCLUSTERED INDEX idx_orders_customer
ON Orders(CustomerID)
INCLUDE (OrderDate, Status);

3. Reduce Transaction Scope

  • Break large transactions into smaller ones


πŸ“Š Monitoring

  • Query Store

  • Extended Events for deadlocks


πŸ›‘ Prevention

  • Use retry logic in application

  • Limit transaction time


πŸ— Architecture Upgrade (Enterprise Level)

  • Implement read replicas (Always On Availability Groups)

  • Introduce caching layer (e.g., Redis)

  • Use queue-based writes for heavy operations


✅ Outcome

  • 70% reduction in deadlocks

  • 40% faster checkout


🏦 SCENARIO 2: Banking System (High Consistency + Security)


🧬 System Anatomy

  • Financial transactions (ACID critical)

  • 24/7 uptime requirement

  • Strict auditing


🚨 Symptoms

  • Slow transaction processing

  • Locking and blocking

  • Audit queries slow


πŸ” Investigation

SELECT * FROM sys.dm_tran_locks;

πŸ§ͺ Diagnosis

  • Overuse of SERIALIZABLE isolation

  • Large audit table without partitioning


πŸ’Š Treatment

1. Optimize Isolation Levels

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

2. Partition Audit Table

CREATE PARTITION FUNCTION pf_AuditDate (DATE)
AS RANGE RIGHT FOR VALUES ('2024-01-01', '2025-01-01');

3. Add Filtered Index

CREATE INDEX idx_audit_recent
ON AuditLogs(Date)
WHERE Date > '2025-01-01';

πŸ“Š Monitoring

  • Transaction latency

  • Lock wait time


πŸ›‘ Prevention

  • Periodic archiving

  • Index review


πŸ— Architecture Upgrade

  • Use Always On Availability Groups

  • Separate OLTP and reporting databases

  • Implement data encryption (TDE)


✅ Outcome

  • Faster transactions

  • Reduced blocking


πŸ₯ SCENARIO 3: Healthcare Data System (Hybrid OLTP + Analytics)


🧬 System Anatomy

  • Patient records + analytics queries

  • Mixed workload (reads + writes)

  • Large historical data


🚨 Symptoms

  • Reports slow during business hours

  • Doctors complain system lags


πŸ” Investigation

SELECT *
FROM sys.dm_exec_query_stats;

πŸ§ͺ Diagnosis

  • Analytical queries competing with OLTP

  • Table scans on large datasets


πŸ’Š Treatment

1. Create Columnstore Index

CREATE CLUSTERED COLUMNSTORE INDEX cci_visits
ON Visits;

2. Offload Reporting

  • Use read replica


3. Data Partitioning

CREATE PARTITION SCHEME ps_Visits
AS PARTITION pf_AuditDate ALL TO ([PRIMARY]);

πŸ“Š Monitoring

  • Query duration

  • IO usage


πŸ›‘ Prevention

  • Separate workloads

  • Schedule heavy queries


πŸ— Architecture Upgrade

  • Introduce data warehouse (ETL pipeline)

  • Use Azure Synapse / Fabric (if cloud)


✅ Outcome

  • Reports run faster

  • OLTP unaffected


πŸ“‘ SCENARIO 4: Telecom System (Massive Data Ingestion)


🧬 System Anatomy

  • Millions of records per minute

  • Logging, call records, events

  • Append-only workload


🚨 Symptoms

  • Insert performance slowing

  • Log file growing rapidly


πŸ” Investigation

SELECT * FROM sys.dm_db_log_space_usage;

πŸ§ͺ Diagnosis

  • Transaction log bottleneck

  • Too many indexes slowing inserts


πŸ’Š Treatment

1. Minimal Index Strategy

  • Drop unnecessary indexes


2. Batch Inserts

INSERT INTO Logs
SELECT * FROM StagingTable;

3. Switch to SIMPLE recovery (if safe)

ALTER DATABASE TelecomDB SET RECOVERY SIMPLE;

πŸ“Š Monitoring

  • Insert throughput

  • Log usage


πŸ›‘ Prevention

  • Use staging tables

  • Regular log backups


πŸ— Architecture Upgrade

  • Use partitioned tables

  • Implement data streaming (Kafka, Event Hub)

  • Archive old data


✅ Outcome

  • Faster ingestion

  • Stable log growth


☁️ SCENARIO 5: SaaS Multi-Tenant System


🧬 System Anatomy

  • Thousands of tenants

  • Shared database

  • Mixed workloads


🚨 Symptoms

  • One tenant slows entire system

  • Uneven performance


πŸ” Investigation

SELECT *
FROM sys.dm_exec_sessions;

πŸ§ͺ Diagnosis

  • No tenant isolation

  • Resource contention


πŸ’Š Treatment

1. Add Tenant Filtering Index

CREATE INDEX idx_tenant
ON Orders(TenantID);

2. Resource Governor

  • Limit heavy tenants


3. Query Optimization per tenant


πŸ“Š Monitoring

  • Per-tenant performance


πŸ›‘ Prevention

  • Workload isolation


πŸ— Architecture Upgrade

  • Move to database-per-tenant model

  • Use elastic pools (cloud)

  • Implement sharding


✅ Outcome

  • Fair resource usage

  • Predictable performance


🌐 SCENARIO 6: Disaster Recovery & High Availability


🧬 System Anatomy

  • Mission-critical system

  • 99.99% uptime required


🚨 Symptoms

  • Server crash

  • Data loss risk


πŸ” Investigation

  • Check backup history


πŸ§ͺ Diagnosis

  • No proper DR strategy


πŸ’Š Treatment

1. Setup Always On

-- Configure Availability Group (conceptual)

2. Regular Backups

BACKUP DATABASE YourDB TO DISK = 'full.bak';

3. Test restore


πŸ“Š Monitoring

  • Failover readiness


πŸ›‘ Prevention

  • DR drills

  • Backup verification


πŸ— Architecture Upgrade

  • Multi-region replication

  • Automated failover


✅ Outcome

  • Zero/low downtime

  • Business continuity

No comments:

Post a Comment

The Ultimate Guide to SAP ASE (Sybase ASE) Administration and Management-Part 5

  The Ultimate Guide to SAP ASE (Sybase ASE) Administration and Management-Part 5 Part 5 — Security, Auditing, Automation Frameworks, Capaci...