ByteThirst™ QueryWeight™ Calculation Methodology
How we estimate the water, energy, and CO₂ cost of every AI interaction
What Is a QueryWeight?
A QueryWeight is ByteThirst’s term for the estimated environmental cost of a single AI interaction. It combines three metrics: estimated water consumption (mL), estimated energy use (Wh), and estimated carbon emissions (g CO₂). All figures are estimates based on publicly available research — not direct measurements of actual resource consumption.
Summary
ByteThirst is a Chrome browser extension that estimates the water consumption, energy use, and carbon emissions of your AI interactions across 14 platforms: ChatGPT, Claude, Gemini, Copilot, Perplexity, Poe, You.com, Mistral, HuggingChat, Figma AI, Lovable.dev, Bolt.new, NotebookLM, and Google AI Studio. All values are estimates, not precise measurements. Every estimate is presented as a range (low / mid / high) to communicate the significant uncertainty inherent in these calculations. We anchor our model to the best available public measurements and apply scaling factors for query complexity and model size.
Calculation Pipeline
Step 1: Token Estimation
ByteThirst does not have direct access to the internal tokenizers used by each AI platform. Instead, we estimate token counts by dividing the character count of your input and the model's output by a platform-specific characters-per-token ratio. We calibrate platform-specific characters-per-token ratios against each platform's publicly available tokenizer tools and documentation. Ratios range from approximately 3.8 to 4.2 depending on the tokenizer architecture (BPE vs. SentencePiece). Platforms that route to multiple underlying models (Perplexity, Poe, HuggingChat) use a default ratio that is adjusted when the specific model can be detected.
These ratios are calibrated for English text. Other languages—particularly CJK languages, Arabic, and Hindi—may have significantly different characters-per-token ratios. We plan to add language-specific adjustments in a future update.
Multi-Model and Aggregator Platforms
Several supported platforms route queries to different underlying models depending on user settings or query type:
- Copilot (formerly Bing Chat) uses Microsoft-hosted variants of OpenAI's GPT-4 family. The extension activates on both copilot.microsoft.com and bing.com/chat. We apply the same token ratio and energy baseline as ChatGPT.
- Perplexity uses multiple LLMs for different query types. When the specific model is detectable from the page, we apply the corresponding model tier multiplier. Otherwise, we use the standard tier (1.0×) as a conservative default.
- Poe allows users to choose from multiple model providers (GPT-4, Claude, Gemini, and others). The extension attempts to detect the active model from page elements and apply the appropriate tier multiplier and token ratio. If detection fails, the standard tier and default ratio are used.
- NotebookLM is Google’s AI-powered research and note-taking tool, also powered by Gemini models. We apply the same SentencePiece tokenizer ratio and Gemini model tier multipliers.
Model detection on these platforms is heuristic and depends on DOM elements that may change without notice. See Limitation #2 below for details on detection uncertainty.
AI Code Builder Platforms
AI code builders represent a new modality of AI interaction estimated by ByteThirst. Unlike text-based chat, code generation sessions use a split-panel architecture: a chat input generates large volumes of structured code output. A single coding session can produce 20,000–200,000 output characters, driving dramatically higher resource consumption than text-based interactions.
According to Couch (2026), a typical AI coding agent session consumes approximately 41 Wh of energy — roughly 130× more than a standard ChatGPT text query. ByteThirst calibrates its code generation estimates against this benchmark.
For code generation platforms, we apply a reduced output token weight calibrated against the Couch (2026) benchmark of approximately 41 Wh per typical coding agent session. The standard text-chatbot weighting overestimates energy for code generation because these sessions produce substantially more output tokens than typical conversational queries.
Currently estimated code builder platforms:
- Lovable.dev — Full-stack web application generation from natural language prompts. Default tier: large.
- Bolt.new — Uses Claude 3.5/3.7 Sonnet via StackBlitz WebContainers. Client-side compute (WebContainers runtime) is not included in ByteThirst estimates as it runs locally in the user’s browser, not on remote inference servers.
- Google AI Studio — Developer platform for building with Gemini models. Classified as code-gen because interactions tend to involve long system prompts, tool definitions, and multi-turn agentic sessions, making its output profile closer to code-gen agents than standard chat.
Cache Token Handling
Some AI platforms employ prompt caching, where previously seen input tokens are served from cache rather than reprocessed by the full model. Cached tokens consume significantly less compute. ByteThirst applies a significantly reduced multiplier for cache-read tokens when detectable, reflecting the substantially lower energy consumption of cache hits compared to full inference passes.
Extended Thinking Tokens
Models with extended thinking capabilities (such as Claude’s extended thinking mode and OpenAI’s o-series reasoning models) generate internal reasoning tokens that consume inference compute but are not always visible to the user. ByteThirst weights detected thinking tokens consistently with standard output tokens, as they consume equivalent inference compute. When a platform uses both extended thinking and a reasoning model tier multiplier, only the higher factor is applied to avoid double-counting.
Step 2: Energy Estimation
Energy consumption per query is the most studied and most variable component of our pipeline. We surveyed every major public source available as of early 2026 to anchor our baseline estimate.
| Source | Model | Energy per Query | Date | Notes |
|---|---|---|---|---|
| Google (official) | Gemini median text prompt | 0.24 Wh | Aug 2025 | Most transparent industry disclosure. Includes idle capacity and cooling overhead. |
| OpenAI (Altman) | ChatGPT average query | 0.34 Wh | Aug 2025 | Self-reported, less methodological detail. |
| Epoch AI (independent) | GPT-4o, 500 output tokens | 0.30 Wh | Feb 2025 | Based on H100 GPU compute analysis. Short query baseline. |
| Epoch AI (independent) | GPT-4o, ~7,500 input words | 2.5 Wh | Feb 2025 | Long context query. Demonstrates 8x range based on input length. |
| Jegham et al. (arXiv) | GPT-4o short query | 0.42 Wh ± 0.13 | May 2025 | Academic benchmark with uncertainty bounds. |
Our baseline
We use 0.30 Wh as the baseline energy cost for a standard query of approximately 100 input tokens + 500 output tokens = 600 total tokens. This is anchored to the Epoch AI independent estimate for GPT-4o, which falls in the middle of the industry self-reports (Google's 0.24 Wh and OpenAI's 0.34 Wh).
Scaling by query size
Output tokens require significantly more compute than input tokens because each output token requires a full forward pass through the model, while input tokens are processed in parallel during the prefill stage. We apply a research-based weighting factor to normalize compute cost across varying query lengths, derived from published inference cost analyses comparing prefill and decode compute costs.
Energy is then scaled linearly relative to a standard query's effective token count. A query with twice the effective tokens uses approximately twice the energy.
Model tier multipliers
Different models within each platform vary dramatically in compute requirements. We classify models into five tiers:
| Tier | Example Models | Relative Cost | Rationale |
|---|---|---|---|
| Small | GPT-4o-mini, Claude Haiku, Gemini Flash | Significantly below baseline | Smaller parameter count, lower compute. |
| Standard | GPT-4o, Claude Sonnet, Gemini Pro | Baseline | Most common consumer-facing models. |
| Large | GPT-4.1, Claude Opus, Gemini Ultra | Several times baseline | Largest models with highest compute requirements. |
| Reasoning | o1, o3, o4-mini, Claude extended thinking | Substantially higher | These models generate extensive internal chain-of-thought tokens before producing a response. |
| Image generation | DALL-E, Gemini image gen | Highest tier | Per Luccioni et al. (2023), image generation consumes substantially more energy per request than text inference. |
| Code generation | Lovable.dev sessions, Bolt.new | Calibrated separately | Sessions produce 20K–200K output characters. See “AI Code Builder Platforms” in Step 1. |
Energy range
To communicate uncertainty, we present three estimates for every query:
- Low: base × 0.6 (optimistic—assumes best-case hardware utilization, latest-generation chips, and efficient batching)
- Mid: base × 1.0 (baseline—our best single-point estimate)
- High: base × 1.8 (conservative—accounts for older hardware, low utilization, and additional overhead)
Step 3: Water Estimation
Data centers consume water primarily for cooling. We convert energy estimates to water estimates using a water-intensity ratio (milliliters of water per watt-hour of energy consumed).
| Source | Ratio (mL/Wh) | Notes |
|---|---|---|
| Google (official) | 1.08 mL/Wh | Derived: 0.26 mL water per 0.24 Wh query. Comprehensive overhead included. |
| OpenAI (Altman, implied) | 0.94 mL/Wh | Derived: 0.32 mL water per 0.34 Wh query. |
Our range: Low: 0.50 mL/Wh (dry climate with air cooling), Mid: 0.94 mL/Wh (industry average derived from OpenAI disclosure), High: 1.20 mL/Wh (evaporative cooling in warm climates with older infrastructure).
A note on viral claims
The widely cited UC Riverside study (Li et al., 2023) estimated that ChatGPT consumes approximately 519 mL of water per 100 words of output—roughly 52 mL per short query. This figure is approximately 1,000× higher than the industry self-reports from Google and OpenAI. The discrepancy arises because the UC Riverside methodology includes the full lifecycle water footprint of electricity generation (so-called "off-site" or "upstream" water), including water consumed at power plants, in fuel extraction, and in the broader energy supply chain. By contrast, Google's and OpenAI's figures report only the direct ("on-site") water consumed at the data center for cooling. Both approaches are valid for different purposes, but they measure fundamentally different things. ByteThirst uses the direct water consumption methodology because it represents the water physically used at data centers and is the figure most comparable across providers.
Step 4: CO₂ Estimation
We estimate carbon emissions by multiplying energy consumption by a grid carbon intensity factor (grams of CO₂ emitted per watt-hour of electricity consumed).
| Source | Intensity | Notes |
|---|---|---|
| EPA eGRID (2023) | 0.39 kg CO₂/kWh | US national average, location-based. |
| Google (location-based) | 0.09 gCO₂e per Gemini query | Based on actual grid mix at data center locations. |
| Google (market-based) | 0.03 gCO₂e per Gemini query | Includes renewable energy certificate purchases. |
Our range: Low: 0.20 g/Wh (reflects grids with significant renewable penetration), Mid: 0.39 g/Wh (US national average from EPA eGRID), High: 0.60 g/Wh (coal-heavy grids or regions with older infrastructure).
We use location-based emissions rather than market-based emissions. While companies like Google and Microsoft purchase renewable energy certificates (RECs) to offset their electricity usage, location-based accounting reflects the actual carbon intensity of the grid where the data center operates. This is more representative of the real-world emissions impact, since RECs do not necessarily reduce the physical carbon intensity of the electricity consumed at the point of use.
Known Limitations & Uncertainty
- Token estimation is approximate. Our character-to-token ratios are averages for English text. Actual tokenization varies by language, content type (code vs. prose), and specific model version. Errors of 10–20% in token estimation are possible.
- Model tier detection is heuristic. ByteThirst infers the active model from DOM elements on each AI platform's interface. If a platform changes its UI, model detection may temporarily misclassify the model tier until we update the extension.
- Energy-per-token varies widely. The energy cost of inference depends on GPU type (H100 vs. A100 vs. TPUv5), batch size, quantization level, and server utilization. Our baseline assumes mid-range conditions, but actual energy consumption for any single query could be 2–3× higher or lower.
- Water consumption depends on local climate and cooling technology. A data center in Iowa using evaporative cooling will consume significantly more water per watt-hour than a data center in Finland using free air cooling. We cannot determine which data center serves any individual query, so we use an industry-average ratio.
- Cached and short-circuited responses are not detected. Some queries may be served from cache or routed to smaller models, consuming far less energy than our estimates suggest. We have no way to detect this from the client side.
- Reasoning model uncertainty is high. Models like o1, o3, and o4-mini generate internal chain-of-thought tokens that are not visible to the user. The number of internal tokens can vary from 2× to 50× the visible output length. Our multiplier is a conservative midpoint, but individual queries may vary significantly.
- All constants are point-in-time. The energy efficiency of AI inference is improving rapidly. Our constants are based on data available as of early 2026 and will be updated as new measurements are published.
We believe the most honest approach is to communicate this uncertainty directly to our users through range-based estimates rather than false-precision single numbers. If you see a ByteThirst estimate of "0.28 mL (low: 0.10 / high: 0.60)," that range is the message: this is our best guess, but the true value could reasonably fall anywhere within it.
What Our Estimates Include and Do Not Include
Included in our estimates:
- Estimated energy consumed by the AI model's inference computation (GPU/TPU processing)
- Estimated direct water consumed for data center cooling during inference
- Estimated CO₂ emitted from electricity generation powering inference hardware
Not included in our estimates:
- Energy consumed by your device (computer, phone, monitor)
- Energy consumed by network transmission (routers, ISPs, CDNs)
- Water used in manufacturing AI chips or server hardware (embodied water)
- Carbon emissions from manufacturing, shipping, or disposing of hardware (embodied carbon)
- Energy or water consumed during model training (only inference is estimated)
- Upstream water used in the energy supply chain (power plant cooling, fuel extraction)
- Energy consumed by non-inference server operations (load balancing, logging, storage)
Our estimates represent the direct operational footprint of AI inference only. The full lifecycle impact of AI usage — including training, hardware manufacturing, and upstream energy production — is substantially higher but falls outside what can be reasonably estimated on a per-query basis from a browser extension.
Individual vs. Cumulative Impact
A single AI query has a very small environmental footprint — typically a fraction of a milliliter of water and a fraction of a watt-hour of energy. At the individual query level, these amounts are negligible.
ByteThirst aggregates these small amounts over time to show daily and weekly totals. The purpose is awareness of cumulative patterns, not to suggest that any individual query causes meaningful environmental harm. With hundreds of millions of AI queries processed globally each day, the aggregate resource consumption is significant — but that aggregate is made up of individually tiny contributions.
We believe informed users make better choices, and understanding scale is the first step.
Unit Conversions
ByteThirst displays standard volume conversions alongside milliliter values: teaspoons (1 tsp = 4.93 mL), tablespoons (1 tbsp = 14.79 mL), fluid ounces (1 fl oz = 29.57 mL), and cups (1 cup = 236.59 mL). Energy is displayed in Wh and kWh. Carbon is displayed in g and kg. No real-world comparisons or analogies are used — only standard unit conversions.
Comparison with Other Estimates
To validate our model, we compare our mid-range estimate for a standard text query against published per-query figures from other sources:
| Source | Per-query water estimate | Our mid estimate | Ratio |
|---|---|---|---|
| Google (official, Gemini) | 0.26 mL | 0.28 mL | ~1.1× |
| OpenAI (Altman, ChatGPT) | 0.32 mL | 0.28 mL | ~0.9× |
| UC Riverside (Li et al.) | ~52 mL | 0.28 mL | ~0.005× |
Our mid estimate aligns closely with the industry self-reports from Google and OpenAI, falling within 10% of both figures. The UC Riverside figure is not directly comparable due to the inclusion of upstream lifecycle water, as discussed in Step 3 above.
Model Efficiency Landscape
Independent research published in 2025–2026 provides the most detailed cross-model environmental comparisons available. These benchmarks help explain why model choice is the single biggest factor in your AI environmental footprint.
| Model | Est. Energy per Query | Est. Water per Query | Source |
|---|---|---|---|
| Gemini Flash | 0.24 Wh | 0.26 mL | Google (2025) |
| GPT-4o (short query) | 0.42 Wh | ~0.40 mL | Jegham et al. (2025) |
| GPT-4o (long query) | 1.79 Wh | ~1.68 mL | Jegham et al. (2025) |
| GPT-5 (medium response) | ~18–19 Wh | ~17–18 mL | Jegham/URI (2025) |
| DeepSeek-R1 (reasoning) | ~29–34 Wh | ~27–32 mL | Jegham et al. (2025) |
The most energy-intensive models (reasoning models like o3 and DeepSeek-R1) consume over 65 times more energy than the most efficient models. ByteThirst captures this range through its model tier multiplier system and effective token scaling.
Choosing a smaller or more efficient model is the single biggest action a user can take to reduce their AI environmental footprint. Switching from a reasoning model to a lightweight model like Gemini Flash can reduce the environmental cost of a query by an order of magnitude or more.
Note: Independent estimates (Jegham et al.) may differ from vendor self-reports (Google, OpenAI) due to methodology differences. Vendor measurements capture full-stack production overhead including cooling and idle capacity, while independent benchmarks typically estimate GPU-level compute only. Both approaches are valid; we present them side by side for transparency. Gemini Flash's efficiency advantage partly reflects Google's custom Ironwood TPU hardware, which is 30× more power-efficient than Google's first Cloud TPU from 2018.
Efficiency Is Improving Rapidly
AI inference efficiency is improving at an extraordinary pace. Google documented a 33× reduction in energy per Gemini prompt and a 44× reduction in carbon emissions over a single 12-month period (May 2024 to May 2025), achieved through software optimization, model right-sizing, and custom hardware (Ironwood TPU, which is 30× more power-efficient than Google's first Cloud TPU from 2018).
This means ByteThirst's estimates are point-in-time snapshots. As AI providers continue to optimize their inference infrastructure, the environmental cost per query will continue to decrease. We will recalibrate our constants as new measurements are published.
The direction of AI efficiency is strongly positive. As hardware generations advance (Google's Ironwood TPU is 30× more power-efficient than the first Cloud TPU from 2018) and software optimizations compound, ByteThirst's per-query estimates will trend downward over time. We view this as encouraging: the industry is actively reducing the environmental cost of AI inference, and estimating that progress over time is part of what ByteThirst is designed to do.
Source Citations
- Google, "Environmental Report: AI and Energy Use" (August 2025)
- Altman, S., "AI and Energy" blog post, OpenAI (August 2025)
- Epoch AI, "Estimating the energy consumption of LLM inference" (February 2025)
- Jegham, N. et al., "Energy Consumption of Large Language Models: A Systematic Benchmark" arXiv (May 2025)
- Luccioni, A. et al., "Power Hungry Processing: Watts Driving the Cost of AI Deployment?" FAccT (2023)
- US EPA, "eGRID Summary Tables" (2023 data)
- Li, P. et al., "Making AI Less Thirsty" UC Riverside (2023)
- SemiAnalysis, "Inference Cost Analysis" (2024)
- Couch, S.P. (2026), "Electricity use of AI coding agents" — Per-token energy rates for agentic AI coding sessions
- Lovable.dev product documentation — AI code builder architecture and consumption models
- StackBlitz/Bolt.new documentation — WebContainers architecture; Claude 3.5 Sonnet integration
- Google (2025), "Measuring the Environmental Impact of Delivering AI at Google Scale" (arXiv:2508.15734) — Full-stack production methodology; 0.24 Wh per Gemini prompt; 33× energy / 44× carbon reduction
- Jegham, N. et al. (2025), "How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference" (arXiv:2505.09598) — Cross-model benchmarks across 30 LLMs; infrastructure-aware methodology
Invitation for Peer Review
We welcome corrections, updated data, and methodological improvements from researchers, engineers, and anyone with domain expertise. If you spot an error, have access to better measurements, or can suggest a more rigorous approach to any step in our pipeline, please reach out at hello@bytethirst.com. We will credit all contributors who help improve the accuracy of ByteThirst's estimates.
Changelog
| Date | Change |
|---|---|
| March 3, 2026 | v2.0 — Introduced QueryWeight™ terminology. Expanded from 13 to 14 platforms: added Google AI Studio and NotebookLM. Added “What Is a QueryWeight?” section. Added Cache Token Handling section (reduced multiplier for cache reads). Added Extended Thinking Tokens section (output weight for thinking tokens, no double-counting with reasoning multiplier). Updated all compliance language to estimation framing. |
| February 28, 2026 | v1.3 — Expanded from 7 to 13 platforms. Added “AI Code Builder Impact Methodology” section with code generation calibration (Couch 2026). Added Mistral, HuggingChat, Figma AI, Lovable.dev, Bolt.new, and NotebookLM platform ratios. Added “Model Efficiency Landscape” section with cross-model benchmarks (Jegham et al. 2025, Google 2025). Added “Efficiency Is Improving Rapidly” section citing 33×/44× reduction data and Ironwood TPU 30× efficiency. Expanded source citations from 8 to 13. Replaced real-world equivalents with standard unit conversions only. |
| February 17, 2026 | v1.1 — Added token estimation ratios for Copilot, Perplexity, Poe, and You.com. Added multi-model platform handling section. Added “What Our Estimates Include and Do Not Include” section. Added “Individual vs. Cumulative Impact” section. Added “Real-World Equivalents” section. Consolidated Bing Chat under Copilot. Updated FTC Green Guides compliance language. |
| February 15, 2026 | v1.0 — Initial methodology published |