Your Redshift Cluster Is Probably Idle 85% of the Time — And You're Paying for All of It

A production Redshift cluster with a single node running Zero ETL integrations. At 99% disk capacity with recurring storage crises requiring frequent manual intervention. When I ran a workload analysis on this cluster, the result was stark. It was actively processing queries for only 15% of the day. The remaining 85% was idle billing at full hourly rate, doing nothing. This utilization metric is what this article is all about. Because it changes the entire cost comparison between Redshift Provisioned and Redshift Serverless. And most teams fail to measure it. The Question Nobody Answers Concretely "Should we use Provisioned or Serverless?" is one of the most common Redshift questions I encounter. The answer is always some variation of ‘it depends on the workload’. Which is true but not helpful without a method to actually measure your workload and calculate where the crossover point is. This article gives you that method. A SQL query to assess your actual utilization, a formula to calculate your breakeven point, and two production scenarios showing how the math plays out including the stakeholder conversations that almost derailed both decisions. Step 1: Measure Your Actual Utilization Before any cost comparison makes sense, you need to know what your cluster is actually doing. This query segments your workload over the last 7 days into three tiers -- small ( 500GB), and tells you what percentage of each day each tier is active: \n -- Workload Assessment Query: run on your provisioned cluster (last 7 days) -- Segments queries into Small/Medium/Large by data scanned -- Key output: perc_duration_in_day — your utilization percentage WITH scan_list AS (     -- Max bytes scanned per query     SELECT query, MAX(bytes) AS max_scan_bytes     FROM stl_scan WHERE userid > 1     GROUP BY query ), query_list AS (     -- Classify each query by scan size     SELECT w.query, exec_start_time, exec_end_time,         ROUND(total_exec_time/1000/1000.0, 3) AS exec_sec,         max_scan_bytes,         CASE             WHEN max_scan_bytes < 100000000         THEN 'small'   -- < 100 MB             WHEN max_scan_bytes < 500000000000       THEN 'medium'  -- 100 MB – 500 GB             ELSE                                          'large'   -- > 500 GB         END AS size_type     FROM stl_wlm_query w     JOIN scan_list sc ON sc.query = w.query     WHERE exec_start_time >= GETDATE() - 7 ), utilization AS (     -- % of day each workload tier is active (1440 minutes in a day)     -- ... [see full query for hour_list, hour_list_agg, utilization_perc CTEs]     SELECT         AVG(small_activity_perc)  AS avg_small_activity_perc,         AVG(medium_activity_perc) AS avg_medium_activity_perc,         AVG(large_activity_perc)  AS avg_large_activity_perc     FROM utilization_perc  -- derived from minute-level activity bucketing ) -- Final output: one row per workload tier SELECT     workload_type,     perc_of_total_workload,   -- % of total execution time     perc_duration_in_day,     -- YOUR UTILIZATION NUMBER — compare to breakeven     total_query_minutes_in_day, -- used to estimate RPU-hours for Serverless     workload_exec_sec_avg,    -- performance baseline     workload_exec_sec_max,    -- worst-case latency     query_cnt,                -- volume indicator     scan_bytes_avg            -- I/O pattern driving RPU consumption FROM [workload_summary] ORDER BY workload_size_rank; \ What this query delivers: | Metric | What it tells you | Decision impact | |----|----|----| | perc of total workload | % of execution time per size tier | Identifies your dominant workload type | | perc duration in day | % of day each tier is active | Utilization number. Compare directly to breakeven number | | total query minutes in day | Active minutes per tier | Estimates RPU hours needed for Serverless pricing | | workload exec sec avg/max | Avg and max execution time | Performance baseline for post migration comparison | | scan bytes_avg | Average bytes scanned | I/O patterns driving RPU consumption | This is not the full query, but it should give you an idea of how to measure the utilization. The number you care most about from this query is perc duration in_day. The percentage time of the day each workload category is active. That is your utilization data. Step 2: Calculate Your Breakeven Point The breakeven point is the utilization percentage at which Provisioned and Serverless cost exactly the same. Below it, Serverless saves money. Above that, Provisioned is cheaper. Provisioned Monthly = Nodes × On-Demand $/hr × 730 hours Serverless Hourly = RPU × $0.375/RPU-hr \n Breakeven Hours = Provisioned Monthly / Serverless Hourly Breakeven Utilization = Breakeven Hours / 730 × 100 Example 1 — ra3.large (2 nodes) vs 8 RPU Serverless: Provisioned = 2 × $0.250 × 730 = $365/month Serverless/hr = 8 × $0.375 = $3.00/hour \n Breakeven = $365 / $3.00 = 122 hours/month Breakeven Utilization = 122 / 730 = 16.7% If your cluster is active less than 17% of the time, Serverless is cheaper. Full stop. Example 2 — ra3.xlplus (2 nodes) vs 32 RPU Serverless: Provisioned = 2 × $1.086 × 730 = $1,586/month Serverless/hr = 32 × $0.375 = $12.00/hour \n Breakeven = $1,586 / $12.00 = 132 hours/month Breakeven Utilization = 132 / 730 = 18.1% The breakeven threshold is remarkably consistent across node types typically landing between 16–20% utilization regardless of cluster size, which means the utilization question matters far more than which node type you're running. Two Production Clusters — Two Very Different Conversations Case 1: The SaaS Platform (ra3.large, single node) 15% utilization. 99% disk capacity. Recurring storage crises. I would not consider this a production scale design, a single node is a point of failure for a production cluster. The math was clear before I finished the analysis. At 15% utilization below the breakeven point, also adding another node will reduce the utilization even more. Serverless was already cheaper on infrastructure alone with a single node. When I ran the 2-node comparison, the case for Serverless became stronger. For this cluster, here is what the above returned: | Workload | % of Day Active | Queries/Day | |----|----|----| | Small ( 500GB) | 0.0% | 0 | Decision: The utilization data strongly favors Serverless. | Configuration | Monthly Cost | Verdict | |----|----|----| | ra3.large 1 node — current | $182 | At capacity | | ra3.large 2 nodes | $365 | 2× cost | | Serverless 16 RPU, 8h/day | $180 | Same cost, no storage concern as it uses Managed Storage | | Serverless 32 RPU, 8h/day | $310 | Headroom included. Still Cheaper than the standard design. | Note : The correct 2-node Provisioned baseline costs twice as much as Serverless while still requiring manual storage management. Case 2: The Enterprise Data Warehouse (ra3.xlplus, 4 nodes) 60–75% weekday utilization. 12,000–17,000 transactions per day. WLM queue times spiking to 15+ minutes at peak. The numbers pointed clearly toward staying Provisioned with Reserved Instances. Then the stakeholder conversations started. The DBA and infrastructure team pushed back on Serverless. Not on the cost math, but on control. Serverless takes most of the database administration work, so the org standard practices of workload management is thrown out on the day of migration . That's a legitimate concern in a traditional enterprise. Operational muscle memory built around cluster management is a real migration cost that doesn't show up in any pricing calculator. The engineering team challenged the breakeven analysis itself. Seven days of data isn't representative, our peak periods are quarterly, not weekly. Also legitimate. In enterprise settings, a 7-day window can miss month end processing spikes, quarter close reporting surges, or seasonal peaks that would push true utilization well above the measured 15–20%. Both objections were valid, which led to a third option nobody had initially considered. The Hybrid Architecture: Redshift Data Sharing Instead of choosing between Provisioned and Serverless, the recommended architecture used both with Redshift Data Sharing to separate workloads by purpose: Provisioned cluster handles all ETL, CDC, and heavy transformation workload. The high utilization, predictable work that Reserved Instances price efficiently Serverless workgroup serves all BI and reporting consumers via a shared data namespace. Auto scaling for dashboard concurrency spikes without competing with ETL compute This addressed both objections directly. The DBA team retained full WLM control over the ETL cluster. The Serverless layer handled BI burst traffic without requiring the engineering team to trust that a 7 day utilization sample represented their full workload profile. | Configuration | Monthly Cost | Notes | |----|----|----| | ra3.xlplus 4 nodes — current | $3,170 | High utilization justified | | Serverless 64 RPU, ~10h/day | $4,800 | Expensive, loses cluster management control | | Provisioned + Reserved Instances | $2,219 | ETL cluster : 30% RI savings | | Serverless workgroup (BI only) | ~$400–600 | BI serving layer auto-scales | | Hybrid total | ~$2,600–2,800 | Recommended | Decision : Hybrid architecture — Provisioned for ETL, Serverless for BI via Data Sharing. The hybrid came in cheaper than the current setup, eliminated WLM queue time spikes for BI users, and didn't require the DBA team to give up cluster control for their core workload. \ The Decision Framework Run your workload query. Calculate your breakeven. Then use this: | Situation | Recommendation | |----|----| | Utilization 50%, mixed ETL + BI workloads | Hybrid. Provisioned ETL + Serverless BI via Data Sharing | | Utilization >75%, 24/7 continuous | Provisioned + Reserved Instances (55% savings) | The hybrid is the option most comparison articles leave out entirely, and in enterprise settings it's often the most practical answer. What to watch after migration If you move to Serverless or the hybrid architecture, monitor these from day one: \ -- Track RPU consumption patterns post-migration SELECT DATE_TRUNC('hour', start_time) AS hour,     SUM(elapsed_time)/1000000.0 AS total_seconds,     COUNT(*) AS query_count,     AVG(elapsed_time)/1000000.0 AS avg_seconds FROM sys_query_history WHERE start_time >= DATEADD(day, -7, GETDATE()) GROUP BY 1 ORDER BY 1; Watch ComputeSeconds for billing trend, QueryDuration for latency spikes signaling your base RPU needs increasing, and DataStorage for unexpected growth. For the hybrid architecture, monitor both the provisioned cluster's WLM queue times and the Serverless workgroup's RPU consumption side by side to compare the results. The Number That Changes Everything Most teams decide between Provisioned and Serverless based on intuition "we have a lot of queries" or "we only run reports in the morning." The utilization percentage cuts through that intuition with a number you can act on. But as both case studies show, math is only the beginning of the conversation in enterprise settings. The DBA team's concern about losing cluster management/WLM control was legitimate. The engineering team's skepticism about a 7 day sample was legitimate. The right answer wasn't to override those objections with better data, it was to find an architecture that made the objections irrelevant. Pricing based on US East region, 2026 rates. Validate against current AWS pricing before budgeting. Use the AWS Cost Calculator with your specific region and usage profile. Run the query. Calculate your breakeven. Then have the stakeholder conversation with the numbers in hand rather than after it. \n \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook

Why 95% of enterprise GPUs sit idle while AI startups can't get compute

TechRadar

TechnologyJun 5, 2026 · 1 min

Your Redshift Cluster Is Probably Idle 85% of the Time — And You're Paying for All of It

Related stories

Why 95% of enterprise GPUs sit idle while AI startups can't get compute

Gas injection rates in European storages 20% below last year in June

Curing the Multi Agent Hallucination Contagion in Production Clusters

Crusoe Touts 5 Gigawatts’ Worth of Data Centers Deals, Pauses Wyoming Site