by Matt Asay

Contributing Writer

Data pipelines for the rest of us

analysis

Apr 15, 20246 mins

Apache Airflow is a great data pipeline as code, but having most of its contributors work for Astronomer is another example of a problem with open source.

shutterstock 1921909664 blue and yellow pipelines and valves at a gas plant

Depending on your politics, trickle-down economics never worked all that well in the United States under President Ronald Reagan. In open source software, however, it seems to be doing just fine.

I’m not really talking about economic policies, of course, but rather about elite software engineering teams releasing code that ends up powering the not-so-elite mainstream. Take Lyft, for example, which released the popular Envoy project. Or Google, which gave the world Kubernetes (though, as I’ve argued, the goal wasn’t charitable niceties, but rather corporate strategy to outflank the dominant AWS). Airbnb figured out a way to move beyond batch-oriented cron scheduling, gifting us Apache Airflow and data pipelines-as-code.

Today a wide array of mainstream enterprises depend on Airflow, from Walmart to Adobe to Marriott. Though its community includes developers from Snowflake, Cloudera, and more, a majority of the heavy lifting is done by engineers at Astronomer, which employs 16 of the top 25 committers. Astronomer puts this stewardship and expertise to good use, operating a fully managed Airflow service called Astro, but it’s not the only one. Unsurprisingly, the clouds have been quick to create their own services, without commensurate code back, which raises the concern about sustainability.

That code isn’t going to write itself if it can’t pay for itself.

What’s a data pipeline, anyway?

Today everyone is talking about large language models (LLMs), retrieval-augmented generation (RAG), and other generative AI (genAI) acronyms, just as 10 years ago we couldn’t get enough of Apache Hadoop, MySQL, etc. The names change, but data remains, with the ever-present concern for how best to move that data between systems.

This is where Airflow comes in.

In some ways, Airflow is like a seriously upgraded cron job scheduler. Companies start with isolated systems, which eventually need to be stitched together. Or, rather, the data needs to flow between them. As an industry, we’ve invented all sorts of ways to manage these data pipelines, but as data increases, the systems to manage that data proliferate, not to mention the ever-increasing sophistication of the interactions between these components. It’s a nightmare, as the Airbnb team wrote when open sourcing Airflow: “If you consider a fast-paced, medium-sized data team for a few years on an evolving data infrastructure and you have a massively complex network of computation jobs on your hands, this complexity can become a significant burden for the data teams to manage, or even comprehend.”

Written in Python, Airflow naturally speaks the language of data. Think of it as connective tissue that gives developers a consistent way to plan, orchestrate, and understand how data flows between every system. A significant and growing swath of the Fortune 500 depends on Airflow for data pipeline orchestration, and the more they use it, the more valuable it becomes. Airflow is increasingly critical to enterprise data supply chains.

So let’s go back to the question of money.

Code isn’t going to write itself

There’s a solid community around Airflow, but perhaps 55% or more of the code is contributed by people who work for Astronomer. This puts the company in a great position to support Airflow in production for its customers (through its managed Astro service), but it also puts the project at risk. No, not from Astronomer exercising undue influence on the project. Apache Software Foundation projects are, by definition, never single-company projects. Rather, the risk comes from Astronomer potentially deciding that it can’t financially justify its level of investment.

This is where the allegations of “open source rug pulling” lose their potency. As I’ve recently argued, we have a trillion-dollar free-rider problem in open source. We’ve always had some semblance of this issue. No company contributes out of charity; it’s always about self-interest. One problem is that it can take a long time for companies to understand that their self-interest should compel them to contribute (as happened when Elastic changed its license and AWS discovered that it had to protect billions of dollars in revenue by forking Elasticsearch). This delayed recognition is exacerbated when someone else foots the bill for development.

It’s just too easy to let someone else do the work while you are skimming the profit.

Consider Kubernetes. It’s rightly considered a poster child for community, but look at how concentrated the community contributions are. Since inception, Google has contributed 28% of the code. The next largest contributor is Red Hat, with 11%, followed by VMware with 8%, then Microsoft at 5%. Everyone else is a relative rounding error, including AWS (1%), which dwarfs everyone else for revenue earned from Kubernetes. This is completely fair, as the license allows it. But what happens if Google decides it’s not in the company’s self-interest to keep doing so much development for others’ gain?

One possibility (and the contributor data may support this conclusion) is that companies will recalibrate their investments. For example, over the past two years, Google’s share of contributions fell to 20%, and Red Hat’s dropped to 8%. Microsoft, for its part, increased its relative share of contributions to 8%, and AWS, while still relatively tiny, jumped to 2%. Maybe good communities are self-correcting?

Which brings us back to the question of data.

It’s Python’s world

Because Airflow is built in Python, and Python seems to be every developer’s second language (if not their first), it’s easy for developers to get started. More importantly, perhaps, it’s also easy for them to stop thinking about data pipelines at all. Data engineers don’t really want to maintain data pipelines. They want that plumbing to fade into the background, as it were.

How to make that happen isn’t immediately obvious, particularly given the absolute chaos of today’s data/AI landscape, as captured by FirstMark Capital. Airflow, particularly with a managed service like Astronomer’s Astro, makes it straightforward to preserve optionality (lots of choices in that FirstMark chart) while streamlining the maintenance of pipelines between systems.

This is a big deal that will keep getting bigger as data sources proliferate. That “big deal” should show up more in the contributor table. Today Astronomer developers are the driving force behind Airflow releases. It would be great to see other companies up their contributions, too, commensurate with the revenue they’ll no doubt derive from Airflow.

by Matt Asay

Contributing Writer

Matt Asay runs developer marketing at Oracle. Previously Asay ran developer relations at MongoDB, and before that he was a Principal at Amazon Web Services and Head of Developer Ecosystem for Adobe. Prior to Adobe, Asay held a range of roles at open source companies: VP of business development, marketing, and community at MongoDB; VP of business development at real-time analytics company Nodeable (acquired by Appcelerator); VP of business development and interim CEO at mobile HTML5 start-up Strobe (acquired by Facebook); COO at Canonical, the Ubuntu Linux company; and head of the Americas at Alfresco, a content management startup. Asay is an emeritus board member of the Open Source Initiative (OSI) and holds a JD from Stanford, where he focused on open source and other IP licensing issues. The views expressed in Matt’s posts are Matt’s, and don’t represent the views of his employer.

Show me more

Topics

About

Policies

Our Network

More

Data pipelines for the rest of us

Apache Airflow is a great data pipeline as code, but having most of its contributors work for Astronomer is another example of a problem with open source.

What’s a data pipeline, anyway?

Code isn’t going to write itself

It’s Python’s world

More from this author

Enterprise essentials for generative AI

Why AI fails at business context, and what to do about it

Who does the unsexy but essential work for open source?

Bridging the trust gap in AI-driven development

The importance of memory for AI

Why front-end development will persist

Why LLMs demand a new approach to authorization

Arriving at ‘Hello World’ in enterprise AI

Show me more

Databricks adds Data Science Agent to automate analytics tasks

Rust Innovation Lab launched, sponsors first project

PostgreSQL 18 to boost OLTP performance, but misses AI readiness

Getting encryption wrong (and getting it right, too)

How to build a native desktop app vs. a web UI app

PyApp: Build click-to-run Python apps with Rust