Search

Search this site:

LLM Hallucinations in Practical Code Generation — Phenomena, Mechanism, and Mitigation


Media
article
Title
LLM Hallucinations in Practical Code Generation — Phenomena, Mechanism, and Mitigation
Author
Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, Zibin Zheng
Edited by
Proceedings of the ACM on Software Engineering, Volume 2, Issue ISSTA

How good can Large Language Models (LLMs) be at generating code? This would not seem like a very novel question to ask, as there are several benchmarks such as HumanEval and MBPP, published in 2021, before LLMs burst into the public view starting the current AI inflation. However, this article’s authors point out code generation is very seldom done as isolated functions, but must be deployed in a coherent fashion together with the rest of the project or repository it is meant to be integrated in. By 2024 there are several benchmarks, such as CoderEval o EvoCodeBench, measuring functional correctness of LLM-generated code, measured by test case pass rates.

This article brings a new proposal to the table: comparing LLM-generated repository-level-evaluated code, by examining the hallucinations generated. To do this, they begin by running the Python code generation tasks proposed in the CoderEval benchmark against six code-generating LLMs, analizing the results and building a taxonomy to describe code-based LLM hallucinations, with three types of conflicts (task requirement, factual knowledge and project context) as first-level categories and eight subcategories within them. Second, the authors compare the results of each of the LLMs per main hallucination category. Third, they try to find the root cause for the hallucinations.

The article is structured very clearly, not only presenting the three Research Questions (RQ) but also refering to them as needed to explain why and how each partial result is interpreted. RQ1 (establishing a hallucination taxonomy) is, in my opinion, the most thoroughly explored. RQ2 (LLM comparison) is clear, although it just presents “straight” results, seemingly without much analysis to them. RQ3 (root cause discussion) is undoubtedly interesting, but I feel it to be much more speculative and not directly related to the analysis performed.

After tackling their research questions, they venture a possible mitigation to counter the effect of hallucinations: enhancing the presented LLMs with a retrieval-augmented generation (RAG), so it better understands task requirements, factual knowledge and project context, hopefully reducing hallucination; they present results showing all of the models are clearly although modestly improved by the proposed RAG-based mitigation.

The article is clearly written and easy to read. I would like they would have dedicated more space to detail their RAG implementation, but I suppose it will appear in a follow-up article, as it was only briefly mentioned here. It should provide its target audience, is quite specialized but numerous nowadays, with interesting insights and discussion.