When established technologies take up the most space in training data sets, whatโs to make LLMs recommend new technologies (even if theyโre better)?
Weโre living in a strange time for software development. On the one hand, AI-driven coding assistants have shaken up a hitherto calcified IDE market. As RedMonk Cofounder James Governor puts it, โsuddenly weโre in a position where there is a surprising amount of turbulence in the market for editors,โ when โeverything is in playโ with โso much innovation happening.โ Ironically, that very innovation in genAI may be stifling innovation in the software those coding assistants increasingly recommend. As AWS developer advocate Nathan Peck highlights, โthe brutal truth beneath the magic of AI coding assistantsโ is that โtheyโre only as good as their training data, and that stifles new frameworks.โ
In other words, genAI-driven tools are creating powerful feedback loops that foster winner-takes-all markets, making it hard for innovative, new technologies to take root.
No room for newbies
Iโve written before about genAIโs tendency to undermine its sources for training data. In the software development world, ChatGPT, GitHub Copilot, and other large language models (LLMs) have had a profoundly negative effect on sites like Stack Overflow, even as theyโve had a profoundly positive impact on developer productivity. Why ask a question on Stack Overflow when you can ask Copilot? But every time a developer does that, one less question goes to the public repository used to feed LLMs training data.
Just as bad, we donโt know if the training data is correct in the first place. As I recently noted, โThe LLMs have trained on all sorts of good and bad data from the public Internet, so itโs a bit of a crapshoot as to whether a developer will get good advice from a given tool.โ Presumably each LLM has a way of weighting certain sources of data as more authoritative, but if so, that weighting is completely opaque. AWS, for example, is probably the best source of information for how Amazon Aurora works, but itโs unclear whether developers using Copilot will see documentation from AWS or a random Q&A on Stack Overflow. Iโd hope the LLMs would privilege the creator of the technology as the best source for information about it, but who knows?
And then thereโs the inescapable feedback loop that Peck points out. Itโs worth quoting him at length. Hereโs how he describes the loop:
- Developers choose popular incumbent frameworks because AI recommends them
- This leads to more code being written in these frameworks
- Which provides more training data for AI models
- Making the AI even better at these frameworks, and even more biased toward recommending them
- Attracting even more developers to these incumbent technologies
He then describes how this impacts him as a JavaScript developer. JavaScript has been a hotbed for innovation over the years, with a new framework seemingly emerging every other day. I wrote about this back in 2015, and that frenetic pace has continued for the past decade. Itโs not necessarily something that will continue though, as Peck details, because the LLMs actively discourage developers from trying something new. Peck describes working with the new Bun runtime: โIโve seen firsthand how LLM-based assistants try to push me away from using the Bun native API, back to vanilla JavaScript implementations that look like something I could have written 10 years ago.โ
Why? Because thatโs what the volume of training data is telling the LLMs to suggest. The rich get richer, in other words, and new options struggle to get noticed at all. Thatโs always been somewhat true, of course, but now itโs institutionalized by data-driven tools that donโt listen to anything beyond sheer volumes of data.
As Peck concludes, this โcreates an uphill battle for innovation.โ Itโs always hard to launch or choose new technology, but AI coding assistants make it that much harder. He offers a provocative but appropriate example: If ChatGPT had been โinvented before Kubernetes reached mainstream adoptionโฆ, I donโt think there would have ever been a Kubernetes.โ The LLMs would have pushed developers toward Mesos or other already available options, rather than the new (but eventually superior) option.
What to do?
Open it up
Itโs not clear how we resolve this looming problem. Weโre still in the โwow, this is cool!โ phase of AI coding assistants, and rightly so. But at some point, the tax weโre paying will become evident, and weโll need to figure out how to extricate ourselves from the hole weโre digging.
One thing seems clear: As much as closed-source options may have worked in the past, itโs hard to see how they can survive in the future. As Gergely Orosz posits, โLLMs will be better in languages they have more training on,โ and almost by definition, theyโll have more access to open source technologies. โOpen source code is high-quality training,โ he argues, and starving the LLMs of training data by locking up oneโs code, documentation, etc., is a terrible strategy.
So thatโs one good outcome of this seemingly inescapable LLM feedback loop: more open code. It doesnโt solve the problem of LLMs being biased toward older, established code and thereby inhibiting innovation, but it at least pushes us in the right direction for software, generally.


