Flawed 'LCA' study on AI gives false impression about the state-of-the-art in environmental assessment

Socio-Economic Metabolism

Recently, a student pointed me to a publication that compares the impact of AI-generated text and pictures with human-generated content.

With its headline finding, “AI systems emit between 130 and 1500 times less CO2e per page of text generated compared to human writers”, this study can easily be used to downplay the environmental impacts of large language models. It has already seen traction amongst AI crowds on Twitter (tweeted by Yann LeCun, a big figure in AI) and popular science figures (Sabine Hossenfelder).

The study that I am referring to: “The carbon emissions of writing and illustrating are lower for AI than for humans”, authored by Tomlinson et al., recently appeared in Nature Scientific Reports: https://www.nature.com/articles/s41598-024-54271-x

I have to say that I am surprised that this study made it into a Nature journal. Even the student, who is new to LCA, quickly figured out that the methodology appears flawed in a couple of ways, which is why he approached me.

The criticism is not directed at the results themselves – we do expect much better performance of the AI per unit of bits and bytes generated, simply because the AI is a highly efficient machine that is optimized to do just that – but because of the quality of science that is displayed here, which is insufficient in my opinion.

The study makes a per page and per image comparison, with the implicit assumption that both human & AI-generated material are of the same quality & applicability, which clearly works for some, but not for all text (with lower quality on both sides!). Moreover, there is a massive scaling of AI-generated text, due to the low cost. Not one page of AI-generated text will replace a human-generated text, but there will be multiple iterations/queries to generate a pile of AI samples, from which one is then selected to substitute human creation. For human creativity, the authors scaled down longer periods of human writing to produce large amounts of text to the time required for one page, and these longer periods include corrections and iterations on the human side. So the 1:1 comparison seems flawed and should be corrected in an LCA-based comparison of pages of text or images. Beyond the multiple iterations of AI required to substitute human work, there are of course rebound effects due to a massive potential scale-up of AI-generated material, which the authors correctly state and which is - also correctly stated - beyond the scope of a simple attributional LCA. Our consequential LCA friends should stop reading here :)

Tomlinson et al. roughly follow the attributional LCA framework (without mentioning it) but omit two crucial steps: First, the proper definition of the product system, which is the part of the economy that delivers the product or service, and second, a proper justificaitons that both AI and human-based product system are actually comparable. For both the human and AI-based generation, it needs to be listed what is included and what not, and this list needs to make both systems comparable and in line with existing conventions. They include some scope 3 impacts (via the hardware used), but it's incomplete and insufficiently documented. Next to the GPUs, there is memory, network switches, cooling, buildings, etc., and all of this needs to be allocated to the single query. The simple way the inventory analysis is done here would be OK for a student term paper due at the end of a two weeks block course on LCA but not for a scientific study. Sorry for being so direct here.

The largest flaw, in my opinion, is on the product system of the human-generated text: A time-based downscaling a person’s average annual footprint to the one hour of writing a page of text is not appropriate, as this footprint includes holiday travels, the heating and cooling of the home, the operation of public services, etc., many things that are clearly not attributable to the writing and painting process. This downscaling is not plausible and not common practice. This method would work for the sustenance part of the footprint (food) but for the infrastructure, one would consider only the actual office space used.

I am concerned that this study will give a false impression of the scientific level and quality of current environmental assessment/LCA. It has already seen traction amongst AI crowds on Twitter. I therefore contacted the authors and the journal and posted my concerns on Twitter, mainly for the consideration of the above-mentioned multipliers: https://twitter.com/StefanPauliuk/status/1785215650006225182

Hope that our colleagues from the LCA community to take this up for further debate.

Best regards,

Stefan Pauliuk

[this blog post appears under the SEM section because that's the only drop-down option I can choose ;) ]