RAG Energy Measurement Framework

energyRAGmeasurementdistributed-systemspython

RAG Energy Measurement Framework

Retrieval-Augmented Generation systems are deployed as multi-node pipelines, yet their energy footprint remains opaque—existing benchmarks report only end-to-end totals, hiding where energy is actually spent. This framework instruments a 4-node RAG deployment (CITI Knowledge Management System) to attribute energy at three granularities: per module, per node, and per processing stage. Applied to a 2³ factorial experiment (1,200 queries × 8 configurations = 9,600 sessions), it reveals that architectural choices—not query volume—dominate energy consumption.

Architecture

The measurement infrastructure wraps each pipeline node with hardware-counter-based energy sensors (RAPL for CPU, NVML for GPU), synchronized through a central orchestrator. Each query execution produces a structured energy trace decomposed by processing stage.

Measurement Architecture Node-level energy instrumentation architecture: hardware counters capture per-stage energy across the distributed RAG pipeline

The target system deploys four nodes running distinct pipeline stages: embedding, optional HyDE generation, retrieval with optional reranking, and LLM generation. Three binary module toggles (HyDE, Reranking, Ultra-think) produce 8 distinct configurations evaluated in a full factorial design.

Node Deployment 4-node CITI KMS deployment topology used for energy attribution

Configuration Tree Full 2³ factorial design: 8 configurations from three binary module toggles

Results

Energy by Configuration

The 8 configurations split into two distinct energy clusters driven entirely by a single toggle—HyDE.

Total Energy by Configuration Per-request energy across all 8 configurations: HyDE configurations (1,666–1,786J) vs non-HyDE configurations (<435J)

| Cluster | Configurations | Energy Range | Defining Feature | |---------|---------------|-------------|------------------| | High | C5–C8 | 1,666–1,786 J | HyDE enabled | | Low | C1–C4 | 318–435 J | HyDE disabled |

Always-on activation (all modules enabled) costs 5.6× more energy than the baseline configuration (1,786J vs 318J per request)—yet quality improvement remains below 0.06 points on either evaluation metric.

Stage-Level Attribution

The stage-level breakdown reveals that HyDE’s hypothetical document generation dominates the energy budget, consuming 1,355J per request—roughly 4× the entire baseline request.

Stage Energy Breakdown Energy attribution by processing stage: HyDE generation dwarfs all other stages combined

| Stage | Energy (J) | Share | |-------|-----------|-------| | HyDE generation | 1,355 | 76% (when enabled) | | LLM generation | 250–310 | 14–73% | | Retrieval + reranking | 45–85 | 3–20% | | Embedding | 8–15 | <3% |

Key Findings

  • HyDE dominates: 1,355J marginal cost per request, producing a clear bimodal energy distribution across configurations
  • Diminishing returns: Activating all optional modules yields <0.06 quality improvement for a 5.6× energy penalty
  • Attribution enables optimization: Stage-level decomposition identifies HyDE as the singular target for energy-aware design
  • Measurement scales: The framework handles 9,600 sessions with per-stage granularity, providing the empirical foundation for routing decisions

Outcomes

The framework provides the first stage-attributed energy dataset for a production-grade modular RAG system. Its primary contribution is not the measurement tooling itself, but the empirical insight it enables: module activation decisions have dramatically asymmetric energy-quality trade-offs. This dataset directly informs the companion project on energy-aware query routing, where per-configuration energy profiles become the optimization target.