Monday, April 27, 2026

The Ultimate Guide to AWS RDS SQL Server Administration WITH VLDBs(Part 2)

 


The Ultimate Guide to AWS RDS SQL Server Administration WITH VLDBs(Part 2)

9. Transaction Log Management at Scale

In a 100TB environment, the Transaction Log (T-Log) is often the most volatile component. If it fills up, the entire database goes into a "Read-Only" or "Pending" state, causing a massive outage.

A. How RDS Handles Logs

In AWS RDS, you do not have access to the TRUNCATE ONLY command or the ability to change recovery models to SIMPLE. RDS SQL Server is always in Full Recovery Mode to ensure Point-in-Time Recovery (PITR).

  • Log Backups: RDS automatically backs up your logs to S3 every 5 minutes.

  • The Log "Full" Risk: If you run a massive data import (e.g., 5TB at once), the log will grow faster than RDS can back it up.

B. Prevention Strategies

  1. Batching: Never update 100 million rows in one statement. Use a loop to update 5,000 rows at a time and call CHECKPOINT.

  2. Monitoring: Set a CloudWatch Alarm for DiskQueueDepth and FreeStorageSpace.

Script: Check Which Database is Eating Log Space


SELECT 
    name AS DatabaseName,
    CAST(log_reuse_wait_desc AS VARCHAR(30)) AS [Wait Description],
    DB_CHKSUM = (SELECT TOP 1 CAST(s.size/128.0 AS DECIMAL(10,2)) 
                 FROM sys.master_files s 
                 WHERE s.database_id = d.database_id AND s.type = 1)
FROM sys.databases d;

Note: If Wait Description says LOG_BACKUP, it means RDS is currently moving the log to S3. If it says ACTIVE_TRANSACTION, someone left a query running!


10. Advanced Table Partitioning (The 100TB Necessity)

You cannot manage 100TB as a single "flat" table. It is physically impossible to maintain. Partitioning breaks a massive table into smaller, manageable chunks (usually by date).

Why Partition?

  • Maintenance: You can rebuild indexes on a single month of data instead of the whole 10 years.

  • Data Archival: You can "Switch Out" a partition to an archive table and drop it instantly, rather than running a DELETE command that takes 20 hours.

The Logic:

  1. Partition Function: Defines the boundaries (e.g., Every month).

  2. Partition Scheme: Maps the boundaries to filegroups (In RDS, these map to the primary storage).


11. Troubleshooting with Wait Statistics

When your boss asks, "Why is the database slow?", "Wait Statistics" provide the factual answer. They tell you exactly what the CPU was doing while it was waiting.

Common Wait Types in RDS:

  • PAGEIOLATCH_SH/EX: The disk is too slow. You need more IOPS or more RAM (to keep data in memory).

  • CXPACKET: Parallelism is misconfigured. Re-check your MAXDOP setting.

  • SOS_SCHEDULER_YIELD: The CPU is overwhelmed. You need a larger Instance Class (e.g., moving from an R5 to an R6i).

  • LCK_M_X: Extreme blocking. One user is holding a lock and stopping everyone else.

Script: Finding the Top Wait Types


SELECT TOP 10
    wait_type,
    wait_time_ms / 1000.0 AS WaitSec,
    (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceWaitSec,
    signal_wait_time_ms / 1000.0 AS SignalWaitSec,
    max_wait_time_ms,
    percentage = 100.0 * wait_time_ms / SUM(wait_time_ms) OVER()
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK')
ORDER BY wait_time_ms DESC;

12. Maintenance for Very Large Databases (VLDBs)

Standard maintenance scripts will fail on a 100TB database. They will run forever and block users.

A. Smart Indexing

Do not rebuild all indexes every night. Only target fragmented ones.

  • Fragmentation < 30%: Use REORGANIZE (Online, low impact).

  • Fragmentation > 30%: Use REBUILD (Use the ONLINE = ON flag if using Enterprise Edition).

B. Statistics: The Optimizer's Map

The SQL Query Optimizer is only as good as its statistics. On a 100TB table, the "Auto Update Stats" might only sample 1% of the data, leading to bad execution plans.

  • Solution: For your largest tables, schedule a weekly UPDATE STATISTICS TableName WITH FULLSCAN.

Script: Find Highly Fragmented Indexes


SELECT 
    dbschemas.[name] as 'Schema', 
    dbtables.[name] as 'Table', 
    dbindexes.[name] as 'Index',
    indexstats.avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL, NULL, 'LIMITED') AS indexstats
INNER JOIN sys.tables dbtables on dbtables.[object_id] = indexstats.[object_id]
INNER JOIN sys.schemas dbschemas on dbtables.[schema_id] = dbschemas.[schema_id]
INNER JOIN sys.indexes dbindexes on dbindexes.[object_id] = indexstats.[object_id]
    AND indexstats.index_id = dbindexes.index_id
WHERE indexstats.avg_fragmentation_in_percent > 20
ORDER BY indexstats.avg_fragmentation_in_percent DESC;

13. Cost Optimization at Scale

Running a 100TB RDS instance is expensive. As a DBA, you are expected to save the company money.

  1. Reserved Instances (RI): If you know you'll be running this DB for a year, commit to it. This can save up to 60% compared to On-Demand pricing.

  2. Storage Type Optimization: Check if you truly need io2 Block Express. Sometimes, a well-tuned gp3 volume with high provisioned throughput is enough and costs significantly less.

  3. Delete Unused Snapshots: Manual snapshots incur costs. Create a "Snapshot Lifecycle" to delete ones older than 90 days.


14. AWS Tools Every DBA Needs

Beyond the SQL Server engine, you must master these AWS-specific tools:

  • AWS Systems Manager (SSM): Use "Parameter Store" to hold connection strings and "Maintenance Windows" to automate scripts without a local agent.

  • AWS Lambda: Write simple Python scripts to turn off "Dev/Test" RDS instances at night to save money.

  • Amazon EventBridge: Set up an alert so that if an RDS Failover occurs, you get a text message (via SNS) immediately.


15. Summary Checklist for the Beginner DBA

If you are alone and managing a mission-critical system, follow this daily rhythm:

TimeTaskTool
08:00 AMCheck "Performance Insights" for overnight spikes.AWS Console
09:00 AMReview "Error Logs" for failed logins or deadlocks.SSMS / RDS Logs
12:00 PMMonitor Free Storage Space.CloudWatch
WeeklyRun Index Fragmentation report and update statistics.SQL Scripts
MonthlyReview AWS Cost Explorer to find waste.AWS Billing

Final Thoughts

Managing 10TB to 100TB on AWS RDS is a journey of constant refinement. Start simple: get your backups and security right first. Then, move into the "Dark Arts" of performance tuning and partitioning.

You don't need a senior mentor standing over your shoulder if you have a solid grasp of Wait Statistics and CloudWatch Alarms. They are the "heart rate monitor" of your database.

The Ultimate Guide to AWS RDS SQL Server Administration WITH VLDBs(Part 2)

  The Ultimate Guide to AWS RDS SQL Server Administration WITH VLDBs(Part 2) 9. Transaction Log Management at Scale In a 100TB environment, ...