What do ChatGPT and other large language models owe to the human creators who provide the information they train on? What if creators stop making their insights publicly available?
In tech we are all, ultimately, parasites. As Drupal creator Dries Buytaert said years ago, we are all more βtakerβ than βmaker.β Buytaert was referring to common practice in open source communities: βTakers donβt contribute back meaningfully to the open source project that they take from,β hurting the projects upon which they depend. Even the most ardent open source contributor takes more than she contributes.
This same parasitic trend has played out for Google, Facebook, and Twitterβeach dependent on othersβ contentβand is arguably much more true of generative AI (GenAI) today. Sourcegraph developer Steve Yegge dramatically declares, βLLMs arenβt just the biggest change since social, mobile, or cloudβtheyβre the biggest thing since the World Wide Web,β and heβs likely correct. But those large language models (LLMs) are essentially parasitic in nature: They depend on scraping othersβ repositories of code (GitHub), technology answers (Stack Overflow), literature, and much more.
As has happened in open source, content creators and aggregators are starting to wall off LLM access to their content. In light of declining site traffic, for example, Stack Overflow has joined Reddit in demanding LLM creators pay for the right to use their data to train the LLMs, as detailedΒ by Wired. Itβs a bold move, reminiscent of the licensing wars that have played out in open source and paywalls imposed by publishers to ward off Google and Facebook. But will it work?
Overgrazing the commons
Iβm sure the history of technology parasites predates open source, but thatβs when my career started, so Iβll begin there. Since the earliest days of Linux or MySQL, there were companies set up to profit from othersβ contributions. Most recently in Linux, for example, Rocky Linux and Alma Linux both promise βbug for bug compatibilityβ with Red Hat Enterprise Linux (RHEL), while contributing nothing toward Red Hatβs success. Indeed, the natural conclusion of these two RHEL clonesβ success would be to eliminate their host, leading to their own demise, which is why one person in the Linux space called them the βdirtbagsβ of open source.
Perhaps too colorful a phrase, but you see their point. Itβs the same criticism once lobbed at AWS (a βstrip-miningβ criticism that loses relevance by the day) and has motivated a number of closed source licensing permutations, business model contortions, and seemingly endless discussion about open source sustainability.
Open source, of course, has never been stronger. Individual open source projects, however, have varying degrees of health. Some projects (and project maintainers) have figured out how to manage βtakersβ within their communities; others have not. As a trend, however, open source keeps growing in importance and strength.
Draining the well
This brings us to the LLMs. Large enterprises such as JP Morgan Chase are spending billions of dollarsΒ and hiring more than 1,000 data scientists, machine learning engineers, and others to drive billion-dollar impact in personalization, analytics, etc. Although many enterprises have been skittish to publicly embrace things like ChatGPT, the reality is that their developers are already using LLMs to drive productivity gains.
The cost of those gains is only just now becoming clear. That is, the cost to companies like Stack Overflow that have historically been the source of productivity improvements.
For example, traffic to Stack Overflow traffic has declined by 6% on average every month since January 2022, and dropped a precipitous 13.9% in March 2023, as detailed by Similarweb. Itβs likely an oversimplification to blame ChatGPT and other GenAI-driven tools for such decline, but it would also be naive to think theyβre not involved.
Just ask Peter Nixey, founder of Intentional.io and a top 2% user on Stack Overflow, with answers that have reached more than 1.7 million developers. Despite his prominence on Stack Overflow, Nixey says, βItβs unlikely Iβll ever write anything there again.β Why? Because LLMs like ChatGPT threaten to drain the pool of knowledge on Stack Overflow.
βWhat happens when we stop pooling our knowledge with each other and instead pour it straight into The Machine?β Nixey asks. By βThe Machineβ he is referring to GenAI tools such as ChatGPT. Itβs fantastic to get answers from an AI tool like GitHub Copilot, for example, which was trained on GitHub repositories, Stack Overflow Q&A, etc. But those questions, asked in private, yield no public repository of information, unlike Stack Overflow. βSo while GPT-4 was trained on all of the questions asked before 2021 [on Stack Overflow,] what will GPT-6 train on?β he asks.
One-way information highways
See the problem? Itβs not trivial, and it may be more serious than what weβve haggled over in open source land. βIf this pattern replicates elsewhere and the direction of our collective knowledge alters from outward to humanity to inward into the machine then we are dependent on it in a way that supersedes all of our prior machine dependencies,β he suggests. To put it mildly, this is a problem. βLike a fast-growing COVID-19 variant, AI will become the dominant source of knowledge simply by virtue of growth,β he stresses. βIf we take the example of Stack Overflow, that pool of human knowledge that used to belong to us may be reduced down to a mere weighting inside the transformer.β
Thereβs a lot at stake, and not just the copious quantities of cash that keep flowing into AI. We also need to take stock of the relative worth of the information generated by things like ChatGPT. Stack Overflow, for example, bannedΒ ChatGPT-derived answersΒ in December 2022 because they were text-rich and information-poor: βBecause the average rate of getting correct answers from ChatGPT is too low, the posting of answers created by ChatGPT is substantially harmful to the site and to users who are asking and looking for correct answers [emphasis in original].β Things like ChatGPT arenβt designed to yield correct information, but simply probabilistic information that fits patterns in the data. In other words, open source might be filled with βdirtbags,β but without a steady stream of good training data, LLMs may simply replenish themselves with garbage information, becoming less useful.
Iβm not disparaging the promise of LLMs and GenAI, generally. As with open source, news publishers, and more, we can be grateful for OpenAI and other companies that help us harness collectively produced information while still cheering on contributors like Reddit (itself an aggregator of individual contributions) for expecting payment for the parts they play. Open source had its licensing wars, and it looks like weβre about to have something similar in the world of GenAI, but with bigger consequences.


