Having a βvisionβ is not enough. Enterprises need clear objectives, solid data, and a design plan with built-in evaluations and humans in the loop.
Every day brings a new, better large language model (LLM) or a new approach to finding signal in all the AI noise. Itβs exhausting to try to keep up. But hereβs a comforting yet uncomfortable truth about enterprise AI: Most of whatβs loud today wonβt persist tomorrow. While models trend like memes, frameworks spawn like rabbits, and at any given moment a new βthis time itβs differentβ pattern elbows yesterdayβs breakthrough into irrelevance, the reality is you donβt need to chase every shiny AI object. You just need to master a handful of durable skills and decisions that compound over time.
Think of these durable skills and decisions as the βoperating systemβ of enterprise AI workβthe core upon which everything else runs. Get those elements right and all the other stuffβagents, retrieval-augmented generation (RAG), memory, whatever gets rebranded nextβbecomes a plug-in.
Focus on the job, not the model
The most consequential AI decision is figuring out what problem youβre trying to solve in the first place. This sounds obvious, yet most AI projects still begin with, βWe should use agents!β instead of, βWe need to cut case resolution times by 30%.β Most AI failures trace back to unclear objectives, lack of data readiness (more on that below), and lack of evaluation. Success starts with defining the business problem and establishing key performance indicators (KPIs). This seems ridiculously simple. You canβt declare victory if you havenβt established what victory looks like. However, this all-important first step is commonly overlooked, as Iβve noted.
Hence, itβs critical to translate the business goal into a crisp task spec:
- Inputs: what the system actually receives (structured fields, PDFs, logs)
- Constraints: latency, accuracy thresholds, regulatory boundaries
- Success definition: the metric the business will celebrate (fewer escalations, faster cycle time, lower cost per ticket, etc.)
This task spec drives everything elseβwhether you even need generative AI (often you wonβt), which patterns fit, and how youβll prove value. Itβs also how you stop your project from growing into an unmaintainable βAI experienceβ that does many things poorly.
Make data clean, governed, and retrievable
Your enterpriseβs advantage is not your model; itβs your data, but βwe have a lot of dataβ is not a strategy. Useful AI depends on three things:
- Fitness for use: You want data thatβs clean enough, labeled enough, and recent enough for the task. Perfection is a tax you donβt need to pay; fitness is what matters. Long before genAI became a thing, I wrote, βFor years weβve oversold the glamorous side of data science β¦ while overlooking the simple reality that much of data science is cleaning and preparing data, and this aspect of data science is fundamental to doing data science well.β Thatβs never been more true.
- Governance: Know what data you can use, how you can use it, and under what policy.
- Retrievability: You need to get the right slice of data to the model at inference time. Thatβs not a model problem; itβs a data modeling and indexing problem.
Approaches to retrieval-augmented generation will continue to morph, but hereβs a principle that wonβt: The system can only be as good as the context you retrieve. As Iβve suggested, without organization-specific context such as policies, data, and workflows, even great models will miss the point. We therefore must invest in:
- Document normalization: Consistent formats and chunking should align with how your users ask questions.
- Indexing strategy: Hybrid search (lexical plus vector) is table stakes; tune for the tasks you actually run.
- Freshness pipelines: Your index is a dynamic asset, not a quarterly project. Memory is the βkiller appβ for AI, as Iβve written, but much of that memory must be kept fresh and recent to be useful, particularly for real-time applications.Β
- Meta-permissions: Retrieval must respect row/column/object-level access, not just βwho can use the chatbot.β
In other words, treat your retrieval layer like an API contract. Stability and clarity there outlast any particular RAG library.
Evaluation is software testing for AI (run it like CI)
If your βevaluationβ is two PMs and a demo room, you donβt have evaluation because LLMs fail gracefully right up until they donβt. The way out is automated, repeatable, task-aligned evals. Great AI requires systematic, skeptical evaluation, not vibes-driven development. Hence, success depends on treating model behavior like crash-test engineering, not magic. This means the use of golden sets (representative prompts/inputs and expected outputs, ideally derived from real production traces), numeric- and rubric-based scoring, guardrail checks, and regression gates (no new model, prompt, or retrieval change ships without passing your evaluation suite).
Evaluations are how you get off the treadmill of endless prompt fiddling and onto a track where improvements are proven. They also enable developers to swap models in or out with confidence. You wouldnβt ship back-end code without tests, so stop shipping AI that way.
Design systems, not demos
The earliest wins in enterprise AI came from heroic demos. You know, the stuff you wade through on X all day. (βWow, I canβt believe I can create a full-length movie with a two-line prompt!β) That hype-ware has its place, but truly great AI is dull, as Iβve noted. βAnyone whoβs pushed real software to production knows that getting code to compile, pass tests, and run reliably in the wild is a far tougher slog than generating the code in the first place.β
Sustainable wins come from composable systems with boring interfaces:
- Inference gateways abstract model selection behind a stable API.
- Orchestration layers sequence tools: Retrieval β Reasoning β Action β Verification.
- State and memory are explicit: short-term (per task), session-level (per user), and durable (auditable).
- Observability from logs, traces, cost and latency telemetry, and drift detection.
βAI agentsβ will keep evolving, but theyβre just planners plus tools plus policies. In an enterprise, the policies (permissions, approvals, escalation paths) are the hard part. Build those in early.
Latency, cost, and UX are product features
Enterprises donβt abandon AI because itβs βnot smart enough.β They abandon it because itβs too slow, too expensive, or too weird for users. Here are a few examples:
- Latency: For interactive flows, aim under ~700ms for visible progress and under ~1.5s for a βfeels instantβ reply. This will have a huge impact on your customer experience. Use smaller or distilled models wherever you can and stage responses (e.g., quick summary first, deep analysis on demand).
- Cost: Track tokens like a P&L. Cache aggressively (semantic caching matters), reuse embeddings, and pick models by task need, not ego. Most tasks donβt need your largest model (or a model at all).
- UX: Users want predictability more than surprise. Offer controls (βcite sources,β βshow stepsβ), affordances to correct errors (βedit query,β βthumbs down retrainβ), and consistent failure modes.
AI doesnβt change the laws of enterprise βphysics.β If you can show βwe cut average handle time by 19% at $0.03 per interaction,β your budget conversations around AI become easy, just like any other enterprise technology.
Security, privacy, and compliance are essential design inputs
Nothing kills momentum faster than a late-stage βLegal says no.β Bring them in early and design with constraints as first-class requirements. Enough said. This is the shortest section but arguably the most important.
Keep people in the loop
The fastest way to production is rarely βfull autonomy.β Itβs human-in-the-loop: Assist β Suggest β Approve β Automate. You start with the AI doing the grunt work (drafts, summaries, extractions), and your people verify. Over time, your evals and telemetry make specific steps safe to auto-approve.
There are at least two benefits to this approach. The first is quality: Humans catch the 1% that wrecks trust. The second is adoption: Your team feels augmented, not replaced. That matters if you want real usage rather than quiet revolt. Itβs also essential since the best approach to AI (in software development and beyond) augments skilled people with fast-but-unthinking AI.
Portability or βdonβt marry your modelβ
Andy Oliver is right: βThe latest GPT, Claude, Gemini, and o-series models have different strengths and weaknesses, so it pays to mix and match.β Not only that, but the models are in constant flux, as is their pricing and, very likely, your enterpriseβs risk posture. As such, you donβt want to be hardwired to any particular model. If swapping a model means rewriting your app, you only built a demo, not a system. You also built a problem. Hence, successful deployments follow these principles:
- Abstract behind an inference layer with consistent request/response schemas (including tool call formats and safety signals).
- Keep prompts and policies versioned outside code so you can A/B and roll back without redeploying.
- Dual run during migrations: Send the same request to old and new models and compare via evaluation harness before cutting over.
Portability isnβt just insurance; itβs how you negotiate better with vendors and adopt improvements without fear.
Things that matter less than you think
Iβve been talking about how to ensure success, yet surely some (many!) people who have read up to this point are thinking, βSure, but really itβs about prompt engineering.β Or a better model. Or whatever. These are AI traps. Donβt get carried away by:
- The perfect prompt. Good prompts help; great retrieval, evaluations, and UX help more.
- The biggest model. Most enterprise tasks thrive on right-sized models plus strong context. Context is the key.
- Tomorrowβs acronym. Agents, RAG, memoryβthese are ingredients. Data, evaluation, and orchestration are what make it all work.
- A single vendor to rule them all. Consolidation is nice, but only if your abstractions keep you from being stuck.
These principles and pitfalls may sound sexy and new when applied to AI, but theyβre the same things that make or break enterprise applications, generally. Ultimately, the vendors and enterprises that win in AI will be those that deliver exceptional developer experience or that follow the principles Iβve laid out and avoid the pitfalls.


