Enterprise-Scale SQL Server Common Production Troubleshooting and Architectural Design scenarios
π₯ SCENARIO 1: “The Slow Morning Report” (High CPU Issue)
π¨ Symptom (Patient Complaint)
Every morning at 9 AM, reports become extremely slow
CPU spikes to 90–100%
Users complain system is “frozen”
π Step 1: Initial Assessment
Run:
EXEC sp_who2;
SELECT TOP 5
total_worker_time/execution_count AS avg_cpu,
execution_count,
text
FROM sys.dm_exec_query_stats
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
ORDER BY avg_cpu DESC;
π§ Think First
π What is causing CPU spike only at 9 AM?
π§ͺ Diagnosis
You discover:
A reporting query runs every morning
Query scans millions of rows
Uses
SELECT *No proper indexes
π Treatment
Fix 1: Create covering index
CREATE NONCLUSTERED INDEX idx_report
ON Sales(OrderDate)
INCLUDE (CustomerID, TotalAmount);
Fix 2: Optimize query
SELECT CustomerID, SUM(TotalAmount)
FROM Sales
WHERE OrderDate >= '2025-01-01'
GROUP BY CustomerID;
π Monitoring
SELECT *
FROM sys.dm_exec_query_stats;
π‘ Prevention
Schedule reports off-peak
Use Query Store
Avoid
SELECT *
✅ Outcome
CPU drops from 95% → 40%
Report runs in seconds
π₯ SCENARIO 2: “The Frozen Checkout System” (Blocking Issue)
π¨ Symptom
Users cannot complete transactions
Application “hangs”
Queries waiting indefinitely
π Step 1: Check Blocking
SELECT blocking_session_id, wait_type, wait_time, session_id
FROM sys.dm_exec_requests
WHERE blocking_session_id <> 0;
π§ Think
π What causes blocking in SQL Server?
π§ͺ Diagnosis
You find:
One transaction running for 10 minutes
It is holding locks
Other queries are waiting
π Treatment
Immediate Fix
KILL <blocking_session_id>;
Root Fix
Bad code:
BEGIN TRAN
UPDATE Orders SET Status = 'Processed'
-- No COMMIT
Fix:
BEGIN TRAN
UPDATE Orders SET Status = 'Processed'
COMMIT;
π Monitoring
SELECT * FROM sys.dm_tran_locks;
π‘ Prevention
Keep transactions short
Use proper indexing
Use READ COMMITTED SNAPSHOT
✅ Outcome
System unfreezes
Users can transact
π₯ SCENARIO 3: “The Disk is Always Full” (Storage & Growth Issue)
π¨ Symptom
Database size growing rapidly
Disk almost full
System slowdown
π Step 1: Check Database Size
EXEC sp_spaceused;
π§ Think
π What grows a database?
π§ͺ Diagnosis
You discover:
Large unused indexes
Old data not archived
Transaction log not shrinking
π Treatment
Fix 1: Remove unused index
DROP INDEX idx_unused ON Orders;
Fix 2: Backup and shrink log
BACKUP LOG YourDB TO DISK = 'log.bak';
DBCC SHRINKFILE (YourDB_log, 1000);
Fix 3: Archive old data
DELETE FROM Orders
WHERE OrderDate < '2020-01-01';
π Monitoring
SELECT * FROM sys.master_files;
π‘ Prevention
Regular cleanup jobs
Monitor file growth
Use proper autogrowth settings
✅ Outcome
Disk usage reduced
Performance improves
π₯ SCENARIO 4: “Random Slow Queries” (Parameter Sniffing)
π¨ Symptom
Same query sometimes fast, sometimes slow
No code changes
π Step 1: Capture Query Plan
Enable execution plan and run query multiple times.
π§ Think
π Why inconsistent performance?
π§ͺ Diagnosis
Parameter sniffing issue
SQL reuses bad execution plan
π Treatment
Fix 1: OPTION RECOMPILE
SELECT *
FROM Orders
WHERE CustomerID = @CustomerID
OPTION (RECOMPILE);
Fix 2: Use local variable
DECLARE @C INT = @CustomerID;
SELECT *
FROM Orders
WHERE CustomerID = @C;
π Monitoring
Use Query Store to compare plans.
π‘ Prevention
Use OPTIMIZE FOR
Monitor plan cache
✅ Outcome
Stable performance
π₯ SCENARIO 5: “Deadlock Crisis” (Concurrency Emergency)
π¨ Symptom
Errors: “Deadlock victim”
Transactions fail randomly
π Step 1: Capture Deadlock
SELECT * FROM sys.dm_tran_locks;
π§ Think
π What causes deadlocks?
π§ͺ Diagnosis
Two queries:
Access tables in different order
π Treatment
Fix 1: Standardize access order
-- Always access Customers first, then Orders
Fix 2: Use deadlock priority
SET DEADLOCK_PRIORITY LOW;
π Monitoring
Use Extended Events.
π‘ Prevention
Consistent query patterns
Proper indexing
✅ Outcome
Deadlocks eliminated
π₯ SCENARIO 6: “The Silent Performance Killer” (Missing Indexes)
π¨ Symptom
Gradual slowdown
No obvious issue
π Step 1: Check Missing Indexes
SELECT *
FROM sys.dm_db_missing_index_details;
π§ Think
π Are all indexes helpful?
π§ͺ Diagnosis
Important indexes missing
π Treatment
CREATE NONCLUSTERED INDEX idx_missing
ON Orders(CustomerID);
π Monitoring
Compare query performance before/after.
π‘ Prevention
Regular index review
✅ Outcome
Faster queries
π₯ SCENARIO 7: “The Memory Pressure Case”
π¨ Symptom
Queries slow under load
High PAGE LIFE EXPECTANCY drop
π Step 1: Check Memory
SELECT * FROM sys.dm_os_sys_memory;
π§ Think
π Is memory sufficient?
π§ͺ Diagnosis
Memory pressure
Too many large queries
π Treatment
Add RAM
Optimize queries
Reduce cache bloat
π‘ Prevention
Monitor memory usage
✅ Outcome
Stable performance
2, Building eEnterprise-scale Architectural Design scenarios
π’ SCENARIO 1: Global E-Commerce Platform (High Throughput OLTP System)
𧬠System Anatomy
50,000+ concurrent users
OLTP workload (orders, payments, carts)
SQL Server primary database (~2 TB)
Microservices architecture
π¨ Symptoms
Checkout latency spikes during peak traffic
Deadlocks increasing
CPU consistently above 85%
π Investigation
Run:
SELECT TOP 10 *
FROM sys.dm_exec_requests
ORDER BY cpu_time DESC;
Check waits:
SELECT TOP 10 *
FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC;
π§ͺ Diagnosis
Heavy write contention on Orders table
Hotspot index (clustered index on identity column)
Deadlocks due to concurrent updates
π Treatment (Immediate Fixes)
1. Add Row Versioning
ALTER DATABASE EcommerceDB
SET READ_COMMITTED_SNAPSHOT ON;
2. Optimize Index Design
CREATE NONCLUSTERED INDEX idx_orders_customer
ON Orders(CustomerID)
INCLUDE (OrderDate, Status);
3. Reduce Transaction Scope
Break large transactions into smaller ones
π Monitoring
Query Store
Extended Events for deadlocks
π‘ Prevention
Use retry logic in application
Limit transaction time
π Architecture Upgrade (Enterprise Level)
Implement read replicas (Always On Availability Groups)
Introduce caching layer (e.g., Redis)
Use queue-based writes for heavy operations
✅ Outcome
70% reduction in deadlocks
40% faster checkout
π¦ SCENARIO 2: Banking System (High Consistency + Security)
𧬠System Anatomy
Financial transactions (ACID critical)
24/7 uptime requirement
Strict auditing
π¨ Symptoms
Slow transaction processing
Locking and blocking
Audit queries slow
π Investigation
SELECT * FROM sys.dm_tran_locks;
π§ͺ Diagnosis
Overuse of SERIALIZABLE isolation
Large audit table without partitioning
π Treatment
1. Optimize Isolation Levels
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
2. Partition Audit Table
CREATE PARTITION FUNCTION pf_AuditDate (DATE)
AS RANGE RIGHT FOR VALUES ('2024-01-01', '2025-01-01');
3. Add Filtered Index
CREATE INDEX idx_audit_recent
ON AuditLogs(Date)
WHERE Date > '2025-01-01';
π Monitoring
Transaction latency
Lock wait time
π‘ Prevention
Periodic archiving
Index review
π Architecture Upgrade
Use Always On Availability Groups
Separate OLTP and reporting databases
Implement data encryption (TDE)
✅ Outcome
Faster transactions
Reduced blocking
π₯ SCENARIO 3: Healthcare Data System (Hybrid OLTP + Analytics)
𧬠System Anatomy
Patient records + analytics queries
Mixed workload (reads + writes)
Large historical data
π¨ Symptoms
Reports slow during business hours
Doctors complain system lags
π Investigation
SELECT *
FROM sys.dm_exec_query_stats;
π§ͺ Diagnosis
Analytical queries competing with OLTP
Table scans on large datasets
π Treatment
1. Create Columnstore Index
CREATE CLUSTERED COLUMNSTORE INDEX cci_visits
ON Visits;
2. Offload Reporting
Use read replica
3. Data Partitioning
CREATE PARTITION SCHEME ps_Visits
AS PARTITION pf_AuditDate ALL TO ([PRIMARY]);
π Monitoring
Query duration
IO usage
π‘ Prevention
Separate workloads
Schedule heavy queries
π Architecture Upgrade
Introduce data warehouse (ETL pipeline)
Use Azure Synapse / Fabric (if cloud)
✅ Outcome
Reports run faster
OLTP unaffected
π‘ SCENARIO 4: Telecom System (Massive Data Ingestion)
𧬠System Anatomy
Millions of records per minute
Logging, call records, events
Append-only workload
π¨ Symptoms
Insert performance slowing
Log file growing rapidly
π Investigation
SELECT * FROM sys.dm_db_log_space_usage;
π§ͺ Diagnosis
Transaction log bottleneck
Too many indexes slowing inserts
π Treatment
1. Minimal Index Strategy
Drop unnecessary indexes
2. Batch Inserts
INSERT INTO Logs
SELECT * FROM StagingTable;
3. Switch to SIMPLE recovery (if safe)
ALTER DATABASE TelecomDB SET RECOVERY SIMPLE;
π Monitoring
Insert throughput
Log usage
π‘ Prevention
Use staging tables
Regular log backups
π Architecture Upgrade
Use partitioned tables
Implement data streaming (Kafka, Event Hub)
Archive old data
✅ Outcome
Faster ingestion
Stable log growth
☁️ SCENARIO 5: SaaS Multi-Tenant System
𧬠System Anatomy
Thousands of tenants
Shared database
Mixed workloads
π¨ Symptoms
One tenant slows entire system
Uneven performance
π Investigation
SELECT *
FROM sys.dm_exec_sessions;
π§ͺ Diagnosis
No tenant isolation
Resource contention
π Treatment
1. Add Tenant Filtering Index
CREATE INDEX idx_tenant
ON Orders(TenantID);
2. Resource Governor
Limit heavy tenants
3. Query Optimization per tenant
π Monitoring
Per-tenant performance
π‘ Prevention
Workload isolation
π Architecture Upgrade
Move to database-per-tenant model
Use elastic pools (cloud)
Implement sharding
✅ Outcome
Fair resource usage
Predictable performance
π SCENARIO 6: Disaster Recovery & High Availability
𧬠System Anatomy
Mission-critical system
99.99% uptime required
π¨ Symptoms
Server crash
Data loss risk
π Investigation
Check backup history
π§ͺ Diagnosis
No proper DR strategy
π Treatment
1. Setup Always On
-- Configure Availability Group (conceptual)
2. Regular Backups
BACKUP DATABASE YourDB TO DISK = 'full.bak';
3. Test restore
π Monitoring
Failover readiness
π‘ Prevention
DR drills
Backup verification
π Architecture Upgrade
Multi-region replication
Automated failover
✅ Outcome
Zero/low downtime
Business continuity
No comments:
Post a Comment