The effectiveness of AI coding assistants depends largely on the human in the driverโs seat. Here are eight ways to keep AI hallucinations from infecting your code.
It turns out androids do dream, and their dreams are often strange. In the early days of generative AI, we got human hands with eight fingers and recipes for making pizza sauce from glue. Now, developers working with AI-assisted coding tools are also finding AI hallucinations in their code.
โAI hallucinations in coding tools occur due to the probabilistic nature of AI models, which generate outputs based on statistical likelihoods rather than deterministic logic,โ explains Mithilesh Ramaswamy, a senior engineer at Microsoft. And just like that glue pizza recipe, sometimes these hallucinations escape containment.
AI coding assistants are increasingly omnipresent, and usage is growing, with 62% of respondents saying they were using AI coding tools in the May 2024 Stack Overflow developer survey. So how can you prevent AI hallucinations from ruining your code? We asked developers and tech leaders experienced with using AI coding assistants for their tips.
How AI hallucinations infect code
Microsoftโs Ramaswamy, who works every day with AI tools, keeps a list of the sorts of AI hallucinations he encounters: โGenerated code that doesnโt compile; code that is overly convoluted or inefficient; and functions or algorithms that contradict themselves or produce ambiguous behavior.โ Additionally, he says, โAI hallucinations sometimes just make up nonexistent functionsโ and โgenerated code may reference documentation, but the described behavior doesnโt match what the code does.โ
Komninos Chatzipapas, founder of HeraHaven.ai, gives an example of a specific problem of this type. โOn our JavaScript back-end, we had a function to deduct credit from a user based on their ID,โ he says. โThe function expected an object containing an ID value as its parameter, but the coding assistant just put the ID as the parameter.โ He notes that in loosely typed languages like JavaScript, problems like these are more likely to slip past language parsers. The error Chatzipapas encountered โcrashed our staging environment, but was fortunately caught before pushed to production.โ
How does code like this slip into production? Monojit Banerjee, a lead in the AI platform organization at Salesforce, describes the code output by many AI assistants as โplausible but incorrect or non-functional.โ Brett Smith, distinguished software developer at SAS, notes that less experienced developers are especially likely to be misled by the AI toolโs confidence, โleading to flawed code.โ
The consequences of flawed AI code can be significant. Security holes and compliance issues are top of mind for many software companies, but some issues are less immediately obvious. Faulty AI-generated code adds to overall technical debt, and it can detract from the efficiency code assistants are intended to boost. โHallucinated code often leads to inefficient designs or hacks that require rework, increasing long-term maintenance costs,โ says Microsoftโs Ramaswamy.
Fortunately, the developers we spoke with had plenty of advice about how to ensure AI-generated code is correct and secure. There were two categories of tips: how to minimize the chance of code hallucinations, and how to catch hallucinations after the fact.
Reducing AI hallucinations in your code
The ideal would of course be to never encounter AI hallucinations at all. While thatโs unlikely (not with the current state of the art), the following precautions can help reduce issues in AI-generated code.
Write clear and detailed prompts
The adage โgarbage in, garbage outโ is as old as computer scienceโand it applies to LLMs, as well, especially when youโre generating code by prompting rather than using an autocomplete assistant. Many of the experts we spoke to urged developers to get their prompt engineering game on point. โItโs best to ask bounded questions and critically examine the results,โ says Andrew Sellers, head of technology strategy at Confluent. โUsage data from these tools suggest that outputs tend to be more accurate for questions with a smaller scope, and most developers will be better at catching errors by frequently examining small blocks of code.โ
Ask for references
LLMs like ChatGPT are notorious for making up citations in school papers and legal briefs. But code-specific tools have made great strides in that area. โMany models are supporting citation features,โ says Salesforceโs Banerjee. โA developer should ask for citations or API reference wherever possible to minimize hallucinations.โ
Make sure your AI tool has trained on the latest software
Most genAI chatbots canโt tell you who won your home teamโs baseball game last night, and they have limitations keeping up with software tools and updates as well. โOne of the ways you can predict whether a tool will hallucinate or provide biased outputs is by checking its knowledge cut-offs,โ says Stoyan Mitov, CEO of Dreamix and co-founder of the Citizens app. โIf you plan on using the latest libraries or frameworks that the tool doesnโt know about, the chances that the output will be flawed are high.โ
Train your model to do things your way
Travis Rehl, CTO at Innovative Solutions, says what generative AI tools need to work well is โcontext, context, context.โ You need to provide good examples of what you want and how you want it done, he says. โYou should tell the LLM to maintain a certain pattern, or remind it to use a consistent method so it doesnโt create something new or different.โ If you fail to do so, you can run into a subtle type of hallucination that injects anti-patterns into your code. โMaybe you always make an API call a particular way, but the LLM chooses a different method,โ he says. โWhile technically correct, it did not follow your pattern and thus deviated from what the norm needs to be.โ
A concept that takes this idea to its logical conclusion is retrieval augmented generation, or RAG, in which the model uses one or more designated โsources of truthโ that contain code either specific to the user or at least vetted by them. โGrounding compares the AIโs output to reliable data sources, reducing the likelihood of generating false information,โ says Mitov. RAG is โone of the most effective grounding methods,โ he says. โIt improves LLM outputs by utilizing data from external sources, internal codebases, or API references in real time.โ
Many available coding assistants already integrate RAG featuresโthe one in Cursor is called @codebase, for instance. If you want to create your own internal codebase for an LLM to draw from, you would need to store it in a vector database; Banerjee points to Chroma as one of the most popular options.
Catching AI hallucinations in your code
Even with all of these protective measures, AI coding assistants will sometimes make mistakes. The good news is that hallucinations are often easier to catch in code than in applications where the LLM is writing plain text. The difference is that code is executable and can be tested. โCoding is not subjective,โ as Innovative Solutionsโ Rehl points out. โCode simply wonโt work when itโs wrong.โ Experts offered a few ways to spot mistakes in generated code.
Use AI to evaluate AI-generated code
Believe it or not, AI assistants can evaluate AI-generated code for hallucinationsโoften to good effect. ย For instance, Daniel Lynch, CEO of Empathy First Media, suggests โwriting supporting documentation on the code so that you can have the AI evaluate the provided code in a new instance and determine if it satisfies the requirements of the intended use case.โ
HeraHavenโs Chatzipapas suggests that AI tools can do far more in judging output from other tools. โScaling test-time compute deals with the issue where, for the same input, an LLM can generate a variety of responses, all with different levels of quality,โ he explains. โThere are many ways to make it work but the simplest one is to query the LLM multiple times and then use a smaller โverifierโ AI model to pick which answer is better to present to the end user. There are also more sophisticated ways where you can cluster the different answers you get and pick one from the largest cluster (since that one has received more implied โvotesโ).โ
Maintain human involvement and expertise
Even with machine assistance, most people we spoke to saw human beings as the last line of defense against AI hallucination. Most saw human involvement remaining crucial to the coding process for the foreseeable future. โAlways use AI as a guide, not a source of truth,โ says Microsoftโs Ramaswamy. โTreat AI-generated code as a suggestion, not a replacement for human expertise.โ
That expertise shouldnโt just be around programming generally; you should stay intimately acquainted with the code that powers your applications. โIt can sometimes be hard to spot a hallucination if youโre unfamiliar with a codebase,โ says Rehl. Having hands-on experience in the codebase is critical to spotting deviations in specific methods or the overall code pattern, for example.
Test and review your code
Fortunately, the tools and techniques most well-run shops use to catch human errors, from IDE tools to unit tests, can also catch AI hallucinations. โTeams should continue doing pull requests and code reviews just as if the code were written by humans,โ says Confluentโs Sellers. โItโs tempting for developers to use these tools to automate more in achieving continuous delivery. While laudable, itโs incredibly important for developers to prioritize QA controls when increasing automation.โ
โI cannot stress enough the need to use good linting tools and SAST scanners throughout the development cycle,โ says SASโs Smith. โIDE plugins, integration into the CI, and pull requests are the bare minimum to ensure hallucinations do not make it to production.โ
โA mature devops pipeline is essential, where each line of code will be unit tested during the development lifecycle,โ adds Salesforceโs Banerjee. โThe pipeline will only promote the code to staging and production after tests and builds are passed. Moreover, continuous deployment is essential to roll back code as soon as possible to avoid a long tail of any outage.โ
Highlight AI-generated code
Devansh Agarwal, a machine learning engineer at Amazon Web Services, recommends a technique that he calls โa little experiment of mineโ: Use the code review UI to call out parts of the codebase that are AI-generated. โI often see hundreds of lines of unit test code being approved without any comments from the reviewer,โ he says, โand these unit tests are one of the use cases where I and others often use AI. Once you mark that these are AI-generated, then people take more time in reviewing them.โ
This doesnโt just help catch hallucinations, he says. โItโs a great learning opportunity for everyone in the team. Sometimes it does an amazing job and we as humans want to replicate it!โ
Keep both hands on the wheel
Generative AI is ultimately a tool, nothing more and nothing less. Like all other tools, it has quirks. While using AI changes some aspects of programming and makes individual programmers more productive, its tendency to hallucinate means that human developers must remain diligent in the driverโs seat. โIโm finding that coding will slowly become a QA- and product definition-heavy job,โ says Rehl. As a developer, โyour goal will be to understand patterns, understand testing methods, and be able to articulate the business goal you want the code to achieve.โ


