Apache Airflow is a great data pipeline as code, but having most of its contributors work for Astronomer is another example of a problem with open source.
Depending on your politics, trickle-down economics never worked all that well in the United States under President Ronald Reagan. In open source software, however, it seems to be doing just fine.
Iโm not really talking about economic policies, of course, but rather about elite software engineering teams releasing code that ends up powering the not-so-elite mainstream. Take Lyft, for example, which releasedย the popular Envoyย project. Or Google, which gave the world Kubernetes (though, as Iโve argued, the goal wasnโt charitable niceties, but rather corporate strategy to outflank the dominant AWS). Airbnb figured out a way to move beyondย batch-oriented cron scheduling, giftingย us Apache Airflowย and data pipelines-as-code.
Today a wide array of mainstream enterprises depend on Airflow, from Walmart to Adobe to Marriott. Though its community includes developers from Snowflake, Cloudera, and more, a majority of the heavy lifting is done by engineers at Astronomer, which employs 16 of the top 25 committers. Astronomer puts this stewardship and expertise to good use, operating a fully managed Airflow service called Astro, but itโs not the only one. Unsurprisingly, the clouds have been quick to create their own services, without commensurate code back, which raises the concern about sustainability.
That code isnโt going to write itself if it canโt pay for itself.
Whatโs a data pipeline, anyway?
Today everyone is talking about large language models (LLMs), retrieval-augmented generation (RAG), and other generative AI (genAI) acronyms, just as 10 years ago we couldnโt get enough of Apache Hadoop, MySQL, etc. The names change, but data remains, with the ever-present concern for how best to move that data between systems.
This is where Airflow comes in.
In some ways, Airflow is like a seriously upgraded cron job scheduler. Companies start with isolated systems, which eventually need to be stitched together. Or, rather, the data needs to flow between them. As an industry, weโve invented all sorts of ways to manage these data pipelines, but as data increases, the systems to manage that data proliferate, not to mention the ever-increasing sophistication of the interactions between these components. Itโs a nightmare, as the Airbnb team wrote when open sourcing Airflow: โIf you consider a fast-paced, medium-sized data team for a few years on an evolving data infrastructure and you have a massively complex network of computation jobs on your hands, this complexity can become a significant burden for the data teams to manage, or even comprehend.โ
Written in Python, Airflow naturally speaks the language of data. Think of it as connective tissue that gives developers a consistent way to plan, orchestrate, and understand how data flows between every system. A significant and growing swath of the Fortune 500 depends on Airflow for data pipeline orchestration, and the more they use it, the more valuable it becomes. Airflow is increasingly critical to enterprise data supply chains.
So letโs go back to the question of money.
Code isnโt going to write itself
Thereโs a solid community around Airflow, but perhaps 55% or more of the code is contributed by people who work for Astronomer. This puts the company in a great position to support Airflow in production for its customers (through its managed Astro service), but it also puts the project at risk. No, not from Astronomer exercising undue influence on the project. Apache Software Foundation projects are, by definition, never single-company projects. Rather, the risk comes from Astronomer potentially deciding that it canโt financially justify its level of investment.
This is where the allegations of โopen source rug pullingโ lose their potency. As Iโve recently argued, we have a trillion-dollar free-rider problem in open source. Weโve always had some semblance of this issue. No company contributes out of charity; itโs always about self-interest. One problem is that it can take a long time for companies to understand that their self-interest should compel them to contribute (as happened when Elastic changed its license and AWS discovered that it had to protect billions of dollars in revenue by forking Elasticsearch). This delayed recognition is exacerbated when someone else foots the bill for development.
Itโs just too easy to let someone else do the work while you are skimming the profit.
Consider Kubernetes. Itโs rightly considered a poster child for community, but look at how concentrated the community contributions are. Since inception, Google has contributed 28% of the code. The next largest contributor is Red Hat, with 11%, followed by VMware with 8%, then Microsoft at 5%. Everyone else is a relative rounding error, including AWS (1%), which dwarfs everyone else for revenue earned from Kubernetes. This is completely fair, as the license allows it. But what happens if Google decides itโs not in the companyโs self-interest to keep doing so much development for othersโ gain?
One possibility (and the contributor data may support this conclusion) is that companies will recalibrate their investments. For example, over the past two years, Googleโs share of contributions fell to 20%, and Red Hatโs dropped to 8%. Microsoft, for its part, increased its relative share of contributions to 8%, and AWS, while still relatively tiny, jumped to 2%. Maybe good communities are self-correcting?
Which brings us back to the question of data.
Itโs Pythonโs world
Because Airflow is built in Python, and Python seems to be every developerโs second language (if not their first), itโs easy for developers to get started. More importantly, perhaps, itโs also easy for them to stop thinking about data pipelines at all. Data engineers donโt really want to maintain data pipelines. They want that plumbing to fade into the background, as it were.
How to make that happen isnโt immediately obvious, particularly given the absolute chaos of todayโs data/AI landscape, as captured by FirstMark Capital. Airflow, particularly with a managed service like Astronomerโs Astro, makes it straightforward to preserve optionality (lots of choices in that FirstMark chart) while streamlining the maintenance of pipelines between systems.
This is a big deal that will keep getting bigger as data sources proliferate. That โbig dealโ should show up more in the contributor table. Today Astronomer developers are the driving force behind Airflow releases. It would be great to see other companies up their contributions, too, commensurate with the revenue theyโll no doubt derive from Airflow.


