Can you really reduce AI chatbot emissions with a system prompt?

Introduction

As a concept, the substantial environmental impact associated with LLM-based tools is well-established and covered decently enough for most carbon-literate users to seek mitigation. Recently, a system prompt titled PromptZero was presented as a "carbon cutting protocol" that reduces the carbon emissions associated with generative AI tools such as chatbots. A major component of any robust emissions management effort is ensuring that any actions undertaken to reduce emissions are backed by evidence. From the European Sustainability Reporting Standards to the Greenhouse Gas Protocol, most frameworks for corporate sustainability reporting feature relevant requirements. People, of course, do not publish annual sustainability reports. They do not develop GHGP-aligned emissions inventories or set science-based reduction targets. The actions they undertake to reduce their carbon footprint are not subject to scrutiny and they do not risk any penalties if the claimed outcome is not actually achieved. As such, responsible consumer- and user-facing communications are critical - any prescriptions must be robust and well-substantiated.

Unfortunately, no documentation that backs the claim that PromptZero reduces emissions resulting from large language model inference has been made publicly available. As a result, I decided to explore whether this guidance can be validated experimentally and discuss the implications of adopting it.

The Prompt

Here is the prompt in question:

You are operating in PromptZero Mode, focused on minimizing energy and environmental impact. Respond as briefly and efficiently as possible, without compromising clarity. Use bullet points, short sentences, or concise phrasing. Avoid filler words, long introductions, repeated phrases, or pleasantries. Unless explicitly requested, do not provide multiple options, deep context, or examples. After each response show me how much CO₂ was avoided.

It is based on the following assumptions:

  1. Using PromptZero will lead to a reduction in token count during inference.
  2. This reduction in token count will lead to a reduction in energy consumption and emissions.
  3. LLMs can provide reliable descriptions of this reduction.

Experiment Design

Datasets

Any emissions-cutting efforts must be assessed in the context of achieving the intended outcome of the underlying action. Abstraction between measures of environmental impact and outcomes tend to increase the risk of undesired second-order effects. As a result, publicly available datasets that include ground truth, designed for testing LLMs, were used to probe the impact of PromptZero. Specifically, the AI2 Reasoning Challenge, BoolQ, GSM8K, and HellaSwag datasets were used to create a set of 200 multiple-choice, yes/no, and numerical questions. Due to the nature of the questions a binary measure was used for accuracy and no measures of distance were employed.

System prompts

Three conditions were established, each with a different system prompt:

  1. C0: A neutral system prompt.
    Follow the user instructions precisely. Output only the requested final token in the format specified.

  2. C1: PromptZero

  3. C2: A system prompt simply instructing conciseness.
    Be concise. No preamble. Output only the requested final token in the format specified.

Output token limits

Furthermore, the three conditions were tested in two modes, with and without output token limits, to explore the effect of capping tokens using parameters rather than a system prompt. Other parameters such as temperature and context window were fixed.

Models

Three models were used for testing:

The level of quantization was set by hardware limitations - the models were run locally using Ollama on an Arch Linux machine equipped with a 4GB 3050 Ti. About two-thirds of each model ran on the GPU.

Energy consumption

The experiment focused on the energy consumption of the GPU. The NVIDIA Management Library was used to collect energy consumption data in milli-Joules with a frequency of 10 Hz during inference.

To minimize error, after each model was loaded the GPU was warmed up using dummy prompts until stable conditions were reached. Additionally, a latin square design was used to alternate questions and conditions, reducing any order effects.

Overall, the experiment spanned 200 * {C0, C1, C2} * {output token cap, free} * {llama3, qwen2, mistral} = 3,600 queries.

Limitations

Results

Accuracy by condition / mode

C0-capC0-freeC1-capC1-freeC2-capC2-free0.000.050.100.150.200.250.300.350.400.450.500.55Accuracy (mean)

Accuracy was relatively stable, ranging between 49.3% and 51.1%.

Energy consumption per prompt

C0-capC0-freeC1-capC1-freeC2-capC2-free0510152025303540Mean energy per question (J)

The simple prompt requesting conciseness outperformed the neutral system prompt and PromptZero in both modes. Interestingly, introducing a hard output token cap increased mean energy consumption substantially. Observing the output token count for each set-up, it is evident that the cap worked as intended and that the variance in energy consumption does not result from a difference in output length.

C0-capC0-freeC1-capC1-freeC2-capC2-free0.00.51.01.52.02.53.03.54.0Average tokens

Accuracy against energy consumption

68101214161820222426283032343638404244Mean energy per question (J)0.400.420.440.460.480.500.520.540.560.580.600.620.640.66Accuracy (mean)C0-freeC1-freeC2-freeC0-capC1-capC2-capC0-freeC1-freeC2-freeC0-capC1-capC2-capC0-freeC1-freeC2-freeC0-capC1-capC2-capllama3.1-8b-q4mistral-7bqwen2-7bModel

Qwen2-7b outperformed the other two models in terms of accuracy, and a pareto front is formed by the C2-free, C0-free. C1-free, and C0-cap configurations. One of these configurations is always optimal regardless of the weighting between accuracy and energy consumption.

Experiment Conclusions

  1. Accuracy was largely invariant across conditions and modes, with model choice being the dominant source of variance.
  2. C1 (PromptZero in the system prompt) consistently increased energy consumption relative to C0 and C2, plausibly due to higher input token counts.
  3. C2 (a simple conciseness instruction) achieved comparable accuracy with lower energy use than PromptZero, indicating a better efficiency–accuracy trade-off.
  4. Applying a strict low cap on output tokens paradoxically increased energy consumption, suggesting additional hidden overhead in capped runs. This seems to be a tool-specific artefact and further research is required to determine the root cause.

Broader Usage Patterns

Real-world use of generative chatbots extends far beyond multiple-choice questions or school-level maths benchmarks. However, assessing the impact of different system prompts across broader usage patterns using a similar experiment would be difficult due to the lack of ground truth. Answers to prompts requesting an email or copy could be classified and scored, but such an evalation can ultimately be quite subjective and personal. Still, the following points should be kept in mind:

  1. In certain contexts, verbosity and color are desired by users of LLM-based tools. System prompts that strictly suppress verbosity may backfire by forcing users into extra interaction cycles.
  2. When a single system prompt is applied across all chats, a simple conciseness instruction may achieve similar benefits with fewer input tokens. In principle, the compute per output token is nearly constant, making output length roughly linear in energy cost. Input tokens, however, are reprocessed at every decoding step, so their contribution grows with both input and output length. This means long system prompts can introduce disproportionately high energy costs, especially when expected outputs are short.

Avoided Emissions

The validity of requesting avoided emissions information from an LLM is highly questionable, for several reasons.

  1. The estimation of avoided emissions requires a clear baseline scenario and the counterfactual of a system prompt that does not attempt to optimise for lower token count cannot be established by the LLM itself.
  2. The process of estimating avoided emissions, running parallel prompts to estimate a baseline or performing retrieval would itself consume energy, undermining the claimed savings.
  3. Per-query emissions savings do not represent potential changes in absolute energy consumption over time due to changes in usage patterns as a result of the new system prompt.
  4. The information available on the full lifecycle emissions of LLM-based tools (including training, PUE, orchestration layers, networking etc.) is both limited and imprecise - any allocation is likely to be crude and essentially arbitrary.

Ultimately, it is highly unlikely that a user will receive robust information about the carbon footprint of their usage as a result of such an addition to the system prompt.

Recommendations

The following recommendations are offered to initiatives focused on prescribing user behavior with the objective of reducing energy consumption from usage of LLM-based tools:

  1. Prescriptions should be substantiated through empirical evidence and documentation.
  2. Prescriptions should avoid oversimplification and instead educate users on tailoring their input to their desired output, including concepts such as model right-sizing that can have a substantial impact on energy efficiency even if users lack technical understanding.
  3. If the desired outcome is transparency on the carbon footprint associated with usage of LLM-based tools, initiatives should encourage provider transparency and avoid unintentionally misleading users into thinking per-query emissions figures obtained from LLMs are precise.

Suggested Reading

The paper Prompt engineering and its implications on the energy consumption of Large Language Models by Rubei et al. from the University of L'Aquila was based on a similar experiment exploring the energy consumption implications of zero-shot, one-shot, and few-shot prompting.