Enterprise essentials for generative AI

Having a ‘vision’ is not enough. Enterprises need clear objectives, solid data, and a design plan with built-in evaluations and humans in the loop.

user at laptop with genai chatbot assistant

Every day brings a new, better large language model (LLM) or a new approach to finding signal in all the AI noise. It’s exhausting to try to keep up. But here’s a comforting yet uncomfortable truth about enterprise AI: Most of what’s loud today won’t persist tomorrow. While models trend like memes, frameworks spawn like rabbits, and at any given moment a new “this time it’s different” pattern elbows yesterday’s breakthrough into irrelevance, the reality is you don’t need to chase every shiny AI object. You just need to master a handful of durable skills and decisions that compound over time.

Think of these durable skills and decisions as the “operating system” of enterprise AI work—the core upon which everything else runs. Get those elements right and all the other stuff—agents, retrieval-augmented generation (RAG), memory, whatever gets rebranded next—becomes a plug-in.

Focus on the job, not the model

The most consequential AI decision is figuring out what problem you’re trying to solve in the first place. This sounds obvious, yet most AI projects still begin with, “We should use agents!” instead of, “We need to cut case resolution times by 30%.” Most AI failures trace back to unclear objectives, lack of data readiness (more on that below), and lack of evaluation. Success starts with defining the business problem and establishing key performance indicators (KPIs). This seems ridiculously simple. You can’t declare victory if you haven’t established what victory looks like. However, this all-important first step is commonly overlooked, as I’ve noted.

Hence, it’s critical to translate the business goal into a crisp task spec:

Inputs: what the system actually receives (structured fields, PDFs, logs)
Constraints: latency, accuracy thresholds, regulatory boundaries
Success definition: the metric the business will celebrate (fewer escalations, faster cycle time, lower cost per ticket, etc.)

This task spec drives everything else—whether you even need generative AI (often you won’t), which patterns fit, and how you’ll prove value. It’s also how you stop your project from growing into an unmaintainable “AI experience” that does many things poorly.

Make data clean, governed, and retrievable

Your enterprise’s advantage is not your model; it’s your data, but “we have a lot of data” is not a strategy. Useful AI depends on three things:

Fitness for use: You want data that’s clean enough, labeled enough, and recent enough for the task. Perfection is a tax you don’t need to pay; fitness is what matters. Long before genAI became a thing, I wrote, “For years we’ve oversold the glamorous side of data science … while overlooking the simple reality that much of data science is cleaning and preparing data, and this aspect of data science is fundamental to doing data science well.” That’s never been more true.
Governance: Know what data you can use, how you can use it, and under what policy.
Retrievability: You need to get the right slice of data to the model at inference time. That’s not a model problem; it’s a data modeling and indexing problem.

Approaches to retrieval-augmented generation will continue to morph, but here’s a principle that won’t: The system can only be as good as the context you retrieve. As I’ve suggested, without organization-specific context such as policies, data, and workflows, even great models will miss the point. We therefore must invest in:

Document normalization: Consistent formats and chunking should align with how your users ask questions.
Indexing strategy: Hybrid search (lexical plus vector) is table stakes; tune for the tasks you actually run.
Freshness pipelines: Your index is a dynamic asset, not a quarterly project. Memory is the “killer app” for AI, as I’ve written, but much of that memory must be kept fresh and recent to be useful, particularly for real-time applications.
Meta-permissions: Retrieval must respect row/column/object-level access, not just “who can use the chatbot.”

In other words, treat your retrieval layer like an API contract. Stability and clarity there outlast any particular RAG library.

Evaluation is software testing for AI (run it like CI)

If your “evaluation” is two PMs and a demo room, you don’t have evaluation because LLMs fail gracefully right up until they don’t. The way out is automated, repeatable, task-aligned evals. Great AI requires systematic, skeptical evaluation, not vibes-driven development. Hence, success depends on treating model behavior like crash-test engineering, not magic. This means the use of golden sets (representative prompts/inputs and expected outputs, ideally derived from real production traces), numeric- and rubric-based scoring, guardrail checks, and regression gates (no new model, prompt, or retrieval change ships without passing your evaluation suite).

Evaluations are how you get off the treadmill of endless prompt fiddling and onto a track where improvements are proven. They also enable developers to swap models in or out with confidence. You wouldn’t ship back-end code without tests, so stop shipping AI that way.

Design systems, not demos

The earliest wins in enterprise AI came from heroic demos. You know, the stuff you wade through on X all day. (“Wow, I can’t believe I can create a full-length movie with a two-line prompt!”) That hype-ware has its place, but truly great AI is dull, as I’ve noted. “Anyone who’s pushed real software to production knows that getting code to compile, pass tests, and run reliably in the wild is a far tougher slog than generating the code in the first place.”

Sustainable wins come from composable systems with boring interfaces:

Inference gateways abstract model selection behind a stable API.
Orchestration layers sequence tools: Retrieval → Reasoning → Action → Verification.
State and memory are explicit: short-term (per task), session-level (per user), and durable (auditable).
Observability from logs, traces, cost and latency telemetry, and drift detection.

“AI agents” will keep evolving, but they’re just planners plus tools plus policies. In an enterprise, the policies (permissions, approvals, escalation paths) are the hard part. Build those in early.

Latency, cost, and UX are product features

Enterprises don’t abandon AI because it’s “not smart enough.” They abandon it because it’s too slow, too expensive, or too weird for users. Here are a few examples:

Latency: For interactive flows, aim under ~700ms for visible progress and under ~1.5s for a “feels instant” reply. This will have a huge impact on your customer experience. Use smaller or distilled models wherever you can and stage responses (e.g., quick summary first, deep analysis on demand).
Cost: Track tokens like a P&L. Cache aggressively (semantic caching matters), reuse embeddings, and pick models by task need, not ego. Most tasks don’t need your largest model (or a model at all).
UX: Users want predictability more than surprise. Offer controls (“cite sources,” “show steps”), affordances to correct errors (“edit query,” “thumbs down retrain”), and consistent failure modes.

AI doesn’t change the laws of enterprise “physics.” If you can show “we cut average handle time by 19% at $0.03 per interaction,” your budget conversations around AI become easy, just like any other enterprise technology.

Security, privacy, and compliance are essential design inputs

Nothing kills momentum faster than a late-stage “Legal says no.” Bring them in early and design with constraints as first-class requirements. Enough said. This is the shortest section but arguably the most important.

Keep people in the loop

The fastest way to production is rarely “full autonomy.” It’s human-in-the-loop: Assist → Suggest → Approve → Automate. You start with the AI doing the grunt work (drafts, summaries, extractions), and your people verify. Over time, your evals and telemetry make specific steps safe to auto-approve.

There are at least two benefits to this approach. The first is quality: Humans catch the 1% that wrecks trust. The second is adoption: Your team feels augmented, not replaced. That matters if you want real usage rather than quiet revolt. It’s also essential since the best approach to AI (in software development and beyond) augments skilled people with fast-but-unthinking AI.

Portability or ‘don’t marry your model’

Andy Oliver is right: “The latest GPT, Claude, Gemini, and o-series models have different strengths and weaknesses, so it pays to mix and match.” Not only that, but the models are in constant flux, as is their pricing and, very likely, your enterprise’s risk posture. As such, you don’t want to be hardwired to any particular model. If swapping a model means rewriting your app, you only built a demo, not a system. You also built a problem. Hence, successful deployments follow these principles:

Abstract behind an inference layer with consistent request/response schemas (including tool call formats and safety signals).
Keep prompts and policies versioned outside code so you can A/B and roll back without redeploying.
Dual run during migrations: Send the same request to old and new models and compare via evaluation harness before cutting over.

Portability isn’t just insurance; it’s how you negotiate better with vendors and adopt improvements without fear.

Things that matter less than you think

I’ve been talking about how to ensure success, yet surely some (many!) people who have read up to this point are thinking, “Sure, but really it’s about prompt engineering.” Or a better model. Or whatever. These are AI traps. Don’t get carried away by:

The perfect prompt. Good prompts help; great retrieval, evaluations, and UX help more.
The biggest model. Most enterprise tasks thrive on right-sized models plus strong context. Context is the key.
Tomorrow’s acronym. Agents, RAG, memory—these are ingredients. Data, evaluation, and orchestration are what make it all work.
A single vendor to rule them all. Consolidation is nice, but only if your abstractions keep you from being stuck.

These principles and pitfalls may sound sexy and new when applied to AI, but they’re the same things that make or break enterprise applications, generally. Ultimately, the vendors and enterprises that win in AI will be those that deliver exceptional developer experience or that follow the principles I’ve laid out and avoid the pitfalls.

Topics

About

Policies

Our Network

More

Enterprise essentials for generative AI

Having a ‘vision’ is not enough. Enterprises need clear objectives, solid data, and a design plan with built-in evaluations and humans in the loop.

Focus on the job, not the model

Make data clean, governed, and retrievable

Evaluation is software testing for AI (run it like CI)

Design systems, not demos

Latency, cost, and UX are product features

Security, privacy, and compliance are essential design inputs

Keep people in the loop

Portability or ‘don’t marry your model’

Things that matter less than you think

More from this author

Why DocumentDB can be a win for MongoDB

Why AI fails at business context, and what to do about it

Who does the unsexy but essential work for open source?

Bridging the trust gap in AI-driven development

The importance of memory for AI

Why front-end development will persist

Why LLMs demand a new approach to authorization

Arriving at ‘Hello World’ in enterprise AI

Show me more

Rust Innovation Lab launched, sponsors first project

PostgreSQL 18 to boost OLTP performance, but misses AI readiness

Is Meta’s $10 billion cloud deal a good idea for you?

Getting encryption wrong (and getting it right, too)

How to build a native desktop app vs. a web UI app

PyApp: Build click-to-run Python apps with Rust