When the goal is accuracy, consistency, mastering a game, or finding the one right answer, reinforcement learning models beat generative AI.
The rise of large language models (LLMs) such as GPT-4, with their ability to generate highly fluent, confident text has been remarkable,ย as Iโve written. Sadly, so has the hype: Microsoft researchers breathlessly describedย the Microsoft-funded OpenAI GPT-4ย model as exhibiting โsparks of artificial general intelligence.โ Sorry, Microsoft. No, it doesnโt.
Unless, of course, Microsoft meant the tendency to hallucinateโgenerating incorrect text that is confidently wrongโwhich is all too human. GPTs are also bad at playing games like chess and go, quite iffy at math, and may write code with errors and subtle bugs. Join the club, right?
None of this means that LLMs/GPTs are all hype. Not at all. Instead, it means we need some perspective and far less exaggeration in the generative artificial intelligence (GenAI) conversation.
As detailedย in an IEEE Spectrum article, some experts, such as Ilya Sutskever of OpenAI, believe that adding reinforcement learning with human feedback can eliminate LLM hallucinations. But others, such as Yann LeCun of Meta and Geoff Hinton (recently retired from Google), argue that a more fundamental flaw in large language models is at work. Both believe that large language models lack non-linguistic knowledge, which is critical for understanding the underlying reality that language describes.
In an interview, Diffblue CEO Mathew Lodge argues thereโs a better way: โSmall, fast, and cheap-to-run reinforcement learning models handily beat massive hundred-billion-parameter LLMs at all kinds of tasks, from playing games to writing code.โ
Are we looking for AI gold in the wrong places?
Shall we play a game?
As Lodge related, generative AI definitely has its place, but we may be trying to force it into areas where reinforcement learning is much better. Take games, for example.
Levy Rozman, an International Master at chess, postedย a video where he plays against ChatGPT. The model makes a series of absurd and illegal moves, including capturing its own pieces. The best open source chess software (Stockfish, which doesnโt use neural networks at all) had ChatGPT resigning in less than 10 moves after the LLM could not find a legal move to play. Itโs an excellent demonstration that LLMs fall far short of the hype of general AI, and this isnโt an isolated example.
Google AlphaGo is currently the best go-playing AI, and itโs driven by reinforcement learning. Reinforcement learning works by (smartly) generating different solutions to a problem, trying them out, using the results to improve the next suggestion, and then repeating that process thousands of times to find the best result.
In the case of AlphaGo, the AI tries different moves and generates a prediction of whether itโs a good move and whether it is likely to win the game from that position. It uses that feedback to โfollowโ promising move sequences and to generate other possible moves. The effect is to conduct a search of possible moves.
The process is called probabilistic search. You canโt try every move (there are too many), but you can spend time searching areas of the move space where the best moves are likely to be found. Itโs incredibly effective for game-playing. AlphaGo has beaten go grandmasters in the past. AlphaGo is not infallible, but it currently performs better than the best LLMs today.
Probability versus accuracy
When faced with evidence that LLMs significantly underperform other types of AI, proponents argue that LLMs โwill get better.โ According to Lodge, however, โIf weโre to go along with this argument we need to understand why they will get better at these kinds of tasks.โ This is where things get difficult, he continues, because no one can predict what GPT-4 will produce for a specific prompt. The model is not explainable by humans. Itโs why, he argues, โโprompt engineeringโ is not a thing.โ Itโs also a struggle for AI researchers to prove that โemergent propertiesโ of LLMs exist, much less predict them, he stresses.
Arguably, the best argument is induction. GPT-4 is better at some language tasks than GPT-3 because it is larger. Hence, even larger models will be better. Right? Wellโฆ
โThe only problem is that GPT-4 continues to struggle with the same tasks that OpenAI noted were challenging for GPT-3,โ Lodge argues. Math is one of those; GPT-4 is better than GPT-3 at performing addition but still struggles with multiplication and other mathematical operations.
Making language models bigger doesnโt magically solve these hard problems, and even OpenAI says that larger models are not the answer. The reason comes down to the fundamental nature of LLMs, as notedย in an OpenAI forum: โLarge language models are probabilistic in nature and operate by generating likely outputs based on patterns they have observed in the training data. In the case of mathematical and physical problems, there may be only one correct answer, and the likelihood of generating that answer may be very low.โ
By contrast, AI driven by reinforcement learning is much better at producing accurate results because it is a goal-seeking AI process. Reinforcement learning deliberately iterates toward the desired goal and aims to produce the best answer it can find, closest to the goal. LLMs, notes Lodge, โare not designed to iterate or goal-seek. They are designed to give a โgood enoughโ one-shot or few-shot answer.โ
A โone shotโ answer is the first one the model produces, which is obtained by predicting a sequence of words from the prompt. In a โfew shotโ approach, the model is given additional samples or hints to help it make a better prediction. LLMs also typically incorporate some randomness (i.e., they are โstochasticโ) in order to increase the likelihood of a better response, so they will give different answers to the same questions.
Not that the LLM world neglects reinforcement learning. GPT-4 incorporates โreinforcement learning with human feedbackโ (RLHF). This means that the core model is subsequently trained by human operators to prefer some answers over others, but fundamentally that does not change the answers the model generates in the first place. For example, Lodge says, an LLM might generate the following alternatives to complete the sentence โWayne Gretzky likes ice โฆ.โ
- Wayne Gretzky likes ice cream.
- Wayne Gretzky likes ice hockey.
- Wayne Gretzky likes ice fishing.
- Wayne Gretzky likes ice skating.
- Wayne Gretzky likes ice wine.
The human operator ranks the answers and will probably think a legendary Canadian ice hockey player is more likely to like ice hockey and ice skating, despite ice creamโs broad appeal. The human ranking and many more human-written responses are used to train the model. Note that GPT-4 doesnโt pretend to know Wayne Gretzkyโs preferences accurately, just the most likely completion given the prompt.
In the end, LLMs are not designed to be highly accurate or consistent. Thereโs a trade-off between accuracy and deterministic behavior in return for generality. All of which means, for Lodge, that reinforcement learning beats generative AI for applying AI at scale.
Applying reinforcement learning to software
What about software development?ย As Iโve written, GenAI is already having its moment with developers who have discovered improved productivity using tools like GitHub Copilot or Amazon CodeWhisperer. Thatโs not speculativeโitโs already happening. These tools predict what code might come next based on the code before and after the insertion point in the integrated development environment.
Indeed, as David Ramel ofย Visual Studio Magazineย suggests, the latest version of Copilot already generates 61% of Java code. For those worried this will eliminate software developer jobs, keep in mind that such tools require diligent human supervision to check the completions and edit them to make the code compile and run correctly. Autocomplete has been an IDE staple since the earliest days of IDEs, and Copilot and other code generators are making it much more useful. But large-scale autonomous coding, which would be required to actually write 61% of Java code, itโs not.
Reinforcement learning, however, can do accurate large-scale autonomous coding, Lodge says. Of course, he has a vested interest in saying so: In 2019 his company, Diffblue, released its commercial reinforcement learning-based unit test-writing tool, Cover. Cover writes full suites of unit tests without human intervention, making it possible to automate complex, error-prone tasks at scale.
Is Lodge biased? Absolutely. But he also has a lot of experience to back up his belief that reinforcement learning can outperform GenAI in software development. Today, Diffblue uses reinforcement learning to search the space of all possible test methods, write the test code automatically for each method, and select the best test among those written. The reward function for reinforcement learning is based on various criteria, including coverage of the test and aesthetics, which include a coding style that looks as if a human has written it. The tool creates tests for each method in an average of one second.
If the goal is to automate writing 10,000 unit tests for a program no single person understands, then reinforcement learning is the only real solution, Lodge contends. โLLMs canโt compete; thereโs no way for humans to effectively supervise them and correct their code at that scale, and making models larger and more complicated doesnโt fix that.โ
The takeaway: The most powerful thing about LLMs is that they are general language processors. They can do language tasks they have not been explicitly trained to do. This means they can be great at content generation (copywriting) and plenty of other things. โBut that doesnโt make LLMs a substitute for AI models, often based on reinforcement learning,โ Lodge stresses, โwhich are more accurate, more consistent, and work at scale.โ


