AI systems need humans to guide them with the unique enterprise history, policies, changes, and institutional knowledge before they can produce real business value.
Hereβs the awkward truth about todayβs βsmartβ AI: Itβs great at syntax, mediocre at semantics, and really bad at business context. That last bit matters because most enterprise value hides in the seamsβhow your organization defines active customer, which discount codes apply on Tuesdays, which SKU names were changed after the acquisition, and why revenue means something different to the finance department than the sales team.
Models can ace academic tests and even crank out reasonable SQL. Drop them behind the firewall inside a real company, however, and they stumble. Badly.
Tom Tunguz highlights a sharp example: The Spider 2.0 benchmarks test how well models translate natural language into SQL across realistic enterprise databases. These models peak around 59% exact-match accuracy and fall to roughly 40% when they add transformation/code-generation complexity. These arenβt toy data sets; they reflect messy, sprawling schemas that look like what real enterprises run in production. In other words, the closer we get to real business context, the more the artificial intelligence struggles.
If you build enterprise software, this shouldnβt surprise you. As Iβve noted, developersβ primary issue with AI isnβt whether it can spit out codeβitβs whether they can trust it, consistently, on their data and their rules. Thatβs the βalmost-rightβ tax: You spend time debugging and fact-checking what the model produced because it doesnβt quite understand your specifics.
Why business context is hard for AI
Large models are mostly pattern engines trained on public text. Your business logicβhow you calculate churn, the way your sales territories work, the subtle differences between two nearly identical product linesβisnβt on the public web. That information lives in Jira tickets, PowerPoints, institutional knowledge, and databases whose schemas are artifacts of past decisions (and the key to enterprise AIβs memory). Even the data model fights you: tables with a thousand columns, renamed fields, leaky dimensions, and terminology that drifts with each reorg.
Spider 2.0 measures that reality, which is why scores drop as tasks get closer to actual workflows, such as multi-step queries, joins across unfamiliar schemas, dialect differences, transformations in DBT, etc. Meanwhile, the enterprise is moving toward agentic models that can browse, run code, or query databases, which only magnifies the risk when the modelβs understanding is off.
Put differently: Business context isnβt just data; itβs policy plus process plus history. AI gets the shape of the problem but not the lived reality.
Can we fix this?
The good news is we donβt need a philosophical breakthrough in understanding. We just need better engineering around the memory, grounding, governance, and feedback of the model. Iβve made the case that AI doesnβt need more parameters as much as it needs more memory: structured ways to keep track of what happened before and to retrieve the domain data and definitions that matter. Do that well and you narrow the trust gap.
Is the problem fully solvable? In bounded domains, yes. You can make an AI assistant thatβs reliable on your finance metrics, your customer tables, your DBT models, and your security policies. But business context is a moving target, and humans will keep changing the rules. That means youβll always want humans (including developers, of course) in the loop to clarify intent, adjudicate edge cases, and evolve the system to keep up with the business. The goal isnβt to eliminate people; itβs to turn them into context engineers who teach systems how the business actually works. Hereβs how to get there.
First, if you want reliable answers about your business, the model has to see your business. That starts with retrieval-augmented generation (RAG) that feeds the model the right slices of data and metadataβDDL, schema diagrams, DBT models, even a few representative row samplesβbefore it answers. For text-to-SQL specifically, include table/column descriptions, lineage notes, and known join keys. Retrieval should include governed sources (catalogs, metric stores, lineage graphs), not just a vector soup of PDFs. Spider 2.0βs results make a simple point that when models face unfamiliar schemas, they guess. So, we need to reduce unfamiliarity for the models.
Second, most AI apps are amnesiacs. They start fresh each request, unaware of what came before. You thus need to add layered memory (working, long-term, and episodic memory). The heart of this memory is the database. Databases, especially ones that can store embeddings, metadata, and event logs, are becoming critical to AIβs βmind.β Memory elevates the model from pattern-matching to context-carrying.
Third, free-form text invites ambiguity; structured interfaces reduce it. For text-to-SQL, consider emitting an abstract syntax tree (AST) or a restricted SQL dialect that your execution layer validates and expands. Snap queries to known dimensions/measures in your semantic layer. Use function/tool callingβnot just proseβso the model asks for get_metric('active_users', date_range='Q2') rather than guessing table names. The more you treat the model like a planner using reliable building blocks, the less it hallucinates.
Fourth, humans shouldnβt spend all day correcting commas in SQL. Build an approval flow that focuses attention where ambiguity is highest. For example, highlight risky joins, show previews with row-level diffs against known-good queries, and capture structured feedback (βstatus_code in (3,5) should be excluded from active customersβ) and push it back into memory and retrieval. Over time, your system becomes a better context learner because your experts are training it implicitly as they do their jobs.
Fifth, measure what matters. Benchmarks are useful, but your KPI should be βhelped the finance team close the quarter accurately,β not βpassed Spider 2.0 at 70%.β Hence, you need to build task-specific assessments. Can the system produce the three canonical revenue queries? Does it respect access controls 100% of the time? Run these evaluations nightly. Spider 2.0 also shows that the more realistic the workflow (think Spider2-Vβs multi-step, GUI-spanning tasks), the more room there is to fail. Your evaluations should match that realism.
People and machines
All this should make it clear that however sophisticated AI may get, weβre still going to need people to make it work well. Thatβs a feature, not a bug.
The business context problem is engineering-solvable within a scope. With the right grounding, memory, constraints, evaluations, and security, you can build systems that answer enterprise questions reliably most of the time. Youβll shrink the βalmost-rightβ tax significantly. But context is social. Itβs negotiated in quarterly business reviews and hallway conversations. New products launch, legal policies change, someone tweaks a definition, a merger redraws everything. That continual renegotiation guarantees youβll want human judgment in the loop.
The role of developers shifts accordingly. They go from code generators to context engineers, curators of semantic layers, authors of policy as code, designers of retrieval and memory, and stewards of the feedback loops that keep AI aligned with reality. Thatβs also why developers remain indispensable even as AI gets better. The more we automate, the more valuable it is to have someone who understands both the machine and the business.
If youβre trying to make AI useful in your company, aim for a system that remembers, retrieves, and respects:
- Remembers what happened and whatβs been decided (layered memory)
- Retrieves the right internal truth at the right moment (governed grounding)
- Respects your policies, people, and processes (authorization that travels with the task)
Do that and your AI will feel less like a clever autocomplete and more like a colleague who actually gets it. Not because the model magically developed common sense, but because you engineered the surrounding system to supply it.
Thatβs the real story behind Spider 2.0βs sobering scores, which are not an indictment of AI but a blueprint for where to invest. If your model isnβt delivering on business context, the fix isnβt a different model so much as a different architectureβone that pairs the best of human intelligence with the best of artificial intelligence. In my experience, that partnership is not just inevitable. Itβs the point.


