RAG Energy Measurement Framework

energyRAGmeasurementdistributed-systemspython

RAG Energy Measurement Framework

Retrieval-Augmented Generation systems are deployed as multi-node pipelines, yet their energy footprint remains opaque—existing benchmarks report only end-to-end totals, hiding where energy is actually spent. This framework instruments a 4-node RAG deployment (CITI Knowledge Management System) to attribute energy at three granularities: per module, per node, and per processing stage. Applied to a 2³ factorial experiment (1,200 queries × 8 configurations = 9,600 sessions), it reveals that architectural choices—not query volume—dominate energy consumption.

Architecture

The measurement infrastructure wraps each pipeline node with hardware-counter-based energy sensors (RAPL for CPU, NVML for GPU), synchronized through a central orchestrator. Each query execution produces a structured energy trace decomposed by processing stage.

Measurement Architecture Node-level energy instrumentation architecture: hardware counters capture per-stage energy across the distributed RAG pipeline

The target system deploys four nodes running distinct pipeline stages: embedding, optional HyDE generation, retrieval with optional reranking, and LLM generation. Three binary module toggles (HyDE, Reranking, Ultra-think) produce 8 distinct configurations evaluated in a full factorial design.

Node Deployment 4-node CITI KMS deployment topology used for energy attribution

Configuration Tree Full 2³ factorial design: 8 configurations from three binary module toggles

Results

Energy by Configuration

The 8 configurations split into two distinct energy clusters driven entirely by a single toggle—HyDE.

Total Energy by Configuration Per-request energy across all 8 configurations: HyDE configurations (1,666–1,786J) vs non-HyDE configurations (<435J)

ClusterConfigurationsEnergy RangeDefining Feature
HighC5–C81,666–1,786 JHyDE enabled
LowC1–C4318–435 JHyDE disabled

Always-on activation (all modules enabled) costs 5.6× more energy than the baseline configuration (1,786J vs 318J per request)—yet quality improvement remains below 0.06 points on either evaluation metric.

Stage-Level Attribution

The stage-level breakdown reveals that HyDE’s hypothetical document generation dominates the energy budget, consuming 1,355J per request—roughly 4× the entire baseline request.

Stage Energy Breakdown Energy attribution by processing stage: HyDE generation dwarfs all other stages combined

StageEnergy (J)Share
HyDE generation1,35576% (when enabled)
LLM generation250–31014–73%
Retrieval + reranking45–853–20%
Embedding8–15<3%

Key Findings

  • HyDE dominates: 1,355J marginal cost per request, producing a clear bimodal energy distribution across configurations
  • Diminishing returns: Activating all optional modules yields <0.06 quality improvement for a 5.6× energy penalty
  • Attribution enables optimization: Stage-level decomposition identifies HyDE as the singular target for energy-aware design
  • Measurement scales: The framework handles 9,600 sessions with per-stage granularity, providing the empirical foundation for routing decisions

Outcomes

The framework provides the first stage-attributed energy dataset for a production-grade modular RAG system. Its primary contribution is not the measurement tooling itself, but the empirical insight it enables: module activation decisions have dramatically asymmetric energy-quality trade-offs. This dataset directly informs the companion project on energy-aware query routing, where per-configuration energy profiles become the optimization target.