The tough task of making AI code production-ready

With AI introducing errors and security vulnerabilities as it writes code, humans still have a vital role in testing and evaluation. New AI-based review software hopes to help solve the problem.

shutterstock 1869308242 team putting together a chain of gears teamwork coordination collaboration

Developers are increasingly turning to large language models (LLMs) to crank out code at astonishing volumes. As much as 41% of all code is now written by machines, totaling 256 billion lines in 2024 alone. Even Google, which employs some of the best and brightest developers in the industry, now relies on AI to write upwards of 25% of its code. If this sounds like the promised land of software development—more code, faster, while developers sip Mai Tais on the beach—the truth is not so rosy. After all, anyone who’s pushed real software to production knows that getting code to compile, pass tests, and run reliably in the wild is a far tougher slog than generating the code in the first place. As I’ve noted, “LLM-generated code isn’t magically bug-free or self-maintaining.” Quite the opposite.

In fact, faster code creation may actually slow code readiness due to an increased need for cleaning, debugging, and hardening that code for production. As NativeLink CEO Marcus Eagan puts it, given that “agents have minds of their own,” it becomes critical to be able to identify and contain “the behavioral drift between test environments and production environments.” Indeed, the gap between code generation and production deployment is the elephant in the AI-dev room, prompting the question: Who will do the hard work of compiling, testing, and polishing all this new AI-written code?

People are people

Here’s the uncomfortable truth: As much as we may want robots to do all our work for us, humans still own every hard, critical step that happens after the code is written. AI-generated code often uses incorrect libraries, violates build constraints, and overlooks subtle logic errors. According to a recent survey of 500 engineering leaders, AI models have a knack for introducing subtle bugs and vulnerabilities alongside the boilerplate they generate: 59% reported that AI-generated code introduced errors at least half the time, and 67% said they now spend more time debugging AI-written code than their own. Additionally, 68% of those surveyed said they now spend extra effort to fix security vulnerabilities injected by AI suggestions.

Catch that? Rather than eliminating developer work, AI often shifts the burden further downstream into QA and operations.

That downstream effort is potentially harder with AI, as well, because instead of correcting their own mistakes, developers now need to tackle unfamiliar code. One developer spent 27 days letting an AI agent handle all code and fixes (1,700+ commits with almost no human edits). He found that simple bugs can become hour-long exercises in carefully prompting the AI to fix its own mistakes. “What would be a 5-minute fix for a human often turned into hours of guiding the AI,” he reported, thanks to the AI’s tendency to go off track or introduce new issues while trying to solve existing ones.

In other words, instead of replacing humans, AI is creating new roles and workflows for people. Developers increasingly serve as supervisors, mentors, and validators, reviewing AI-generated code, correcting its mistakes, and ensuring it integrates smoothly into existing systems. In short, the developer’s job isn’t going away—it’s evolving, as I’ve said.

Using machines to fix machines

Companies and open source projects are emerging to address these gaps, automating code validation and testing to complement human oversight. Not surprisingly, many use AI tools to tackle AI deficiencies. A few examples:

AI-enhanced quality scanning: Tools like SonarQube and Snyk now use AI to detect bugs, security issues, and vulnerabilities specifically in AI-generated code. Sonar, for instance, introduced an AI-powered tool set to flag and even automatically fix common coding issues before they merge into your project.
Automated test generation: Diffblue Cover leverages AI to generate robust unit tests for Java code. This speeds up the testing phase dramatically (up to 250 times faster), reducing a major bottleneck for human developers. NativeLink, an open source build cache and remote execution server, helps companies streamline their build processes and reduces build times from days to hours. These kinds of tools become critical to stay ahead of AI-generated code.
AI-assisted code reviews: GitHub Copilot is previewing automated pull request reviews, flagging potential bugs and security flaws before human reviewers even look at the code. Amazon’s CodeGuru and Sourcegraph Cody similarly offer AI-driven debugging and code analysis.
Agentic pipelines: Projects like Zencoder are pioneering multi-agent AI pipelines where specialized bots collaboratively produce, test, refine, and review code, significantly boosting the odds it’s production-ready from the outset.
Secure runtime testing environments: E2B and other platforms provide secure sandbox environments that let AI-written code execute in isolation, automatically checking for compile-time or runtime issues before code reaches human hands.

Getting the most from AI

Even with these advancements, skilled developers remain essential to good software. There are good (and bad) ways to mix human ingenuity with the brute force of machine-written code. What can development teams do today to manage the deluge of AI-generated code and ensure it’s production-ready? I’m glad you asked.

First, treat AI output as a first draft, not final code. Rather than taking AI-generated code as an unquestioned gift, it pays to cultivate a culture of skepticism. Just as you’d review a junior developer’s work, so too should you mandate reviews for AI-generated code. Have senior engineers or code owners give it a thorough look and never, ever deploy AI-written code without reading and testing it.

Second, integrate quality checks into your pipeline. Static analysis, linting, and security scanning should be non-negotiable parts of continuous integration whenever AI code is introduced. Many continuous integration/continuous delivery (CI/CD) tools (Jenkins, GitHub Actions, GitLab CI, etc.) can run suites like SonarQube, ESLint, Bandit, or Snyk on each commit. Enable those checks for all code, especially AI-generated snippets, to catch bugs early. As Sonar’s motto suggests, ensure “all code, regardless of origin, meets quality and security standards” before it merges.

Third, as covered above, you should start leveraging AI for testing, not just coding. AI can help write unit tests or even generate test data. For example, GitHub Copilot can assist in drafting unit tests for functions, and dedicated tools like Diffblue Cover can bulk-generate tests for legacy code. This saves time and also forces AI-generated code to prove itself. Adopt a mindset of “trust, but verify.” If the AI writes a function, have it also supply a handful of test cases, then run them automatically.

Fourth, if your organization hasn’t already, create a policy on how developers should (and shouldn’t) use AI coding tools. Define acceptable use cases (boilerplate generation, examples) and forbidden ones (handling sensitive logic or secrets). Encourage developers to label or comment AI-generated code in pull requests. This helps reviewers know where extra scrutiny is needed. Also, consider licensing implications; make sure any AI-derived code complies with your code licensing policies to avoid legal headaches.

Fifth, as I’ve written, using AI effectively requires more, not less developer skill in certain areas. As such, you need to upskill your team on reading and debugging code. Teach them secure coding practices so they can spot when the AI introduces a SQL injection or buffer overflow. Encourage a testing mindset. Developers should think in terms of writing the test before trusting the function that Copilot gave them. In short, focus on developing “AI literacy” among your programmers; they need to understand both the capabilities and the blind spots of these tools.

Sixth, and perhaps most obviously, get started by piloting new AI-augmented tools. Perhaps it will feel most natural to start by enabling Copilot’s automatic pull request review in a few repositories to see how it augments your human code reviews. Or maybe try an open source tool like E2B in a sandbox project to let an AI agent execute and test its own code. The goal is to find what actually reduces your team’s burden versus what adds more noise.

Looking ahead, the industry may evolve toward greater AI automation in the code validation process. Multi-agent AI systems that autonomously handle compiling, testing, debugging, and security scanning might become commonplace. AI could increasingly manage its own quality assurance, freeing developers to focus more on strategic oversight rather than tactical corrections. For now, however, people matter—and, arguably, always will. Tomorrow’s developers might write fewer lines of direct code but will spend more time defining specifications, constraints, and acceptance criteria that AI-driven systems must follow.

Topics

About

Policies

Our Network

More

The tough task of making AI code production-ready

With AI introducing errors and security vulnerabilities as it writes code, humans still have a vital role in testing and evaluation. New AI-based review software hopes to help solve the problem.

People are people

Using machines to fix machines

Getting the most from AI

More from this author

Why DocumentDB can be a win for MongoDB

Enterprise essentials for generative AI

Why AI fails at business context, and what to do about it

Who does the unsexy but essential work for open source?

Bridging the trust gap in AI-driven development

The importance of memory for AI

Why front-end development will persist

Why LLMs demand a new approach to authorization

Show me more

Rust Innovation Lab launched, sponsors first project

Is Meta’s $10 billion cloud deal a good idea for you?

What makes JavaScript great

Getting encryption wrong (and getting it right, too)

How to build a native desktop app vs. a web UI app

PyApp: Build click-to-run Python apps with Rust