Agentic Software Modernization: Chances and Traps

Modernizing legacy software ( often massive, undocumented “brownfield” projects in languages like COBOL or even older RPG in all its beautiful, different versions) is one of the toughest disciplines in software engineering. The promise of “AI agents” is tantalizing: Can autonomous AI agents automate this exhausting modernization process?

I watched some videos on YouTube (see end of the article), reflected on those and my experience. I think my answer and the answer from experts conducting recent Stanford studies and leading AI engineering firms (such as OpenHands and HumanLayer) is a big YES, but not the way most people think about it.

Simply unleashing AI agents on an old codebase and hoping for a miracle is a recipe for disaster. Successful Agentic Software Modernization requires a fundamental shift in modernization workflows: away from vibe coding towards disciplined preparation and execution.

Based on current findings from the field, here are the essential Do’s and Don’ts for deploying AI agents in software modernization.

The Core Problem: The Context Bottleneck

Before diving into the topic, we must understand the central constraint. AI models (LLMs) are “stateless.” They only know what exists in their current context window.

In complex legacy systems, it is impossible to cram the entire context (millions of lines of code, dependencies, business logic) into this window. When the window becomes too full (according to Dex Horthy of HumanLayer, often above ~40% utilization), the model enters the ‘Dumb Zone’ where response quality degrades rapidly and hallucinations increase.

While approaches like Retrieval-Augmented Generation (RAG) and agentic search (essentially smart use of glob and grep) help with larger datasets, they face a more fundamental problem identified by recent Stanford studies: Entropy. When AI agents work within existing low-quality codebases, the produced code mirrors the low standards of the existing environment (leading to a death spiral).

Consequently, more and more harmful code is produced in a short amount of time, effectively automating the creation of technical debt. The art of Agentic Software Modernization, therefore, must not be about generating more legacy code faster, but about surgically managing the AI agent’s access to the right context to understand and improve the system.

Emerging Practices for Agentic Modernization Workflows

1. Implement a RPI Workflow (Research, Plan, Implement)

The biggest trap is letting the agent code immediately. Instead, the process should be divided into strict phases:

Phase 1: Research (Understanding): The agent analyzes only the existing codebase to understand how a feature works. The output is not code, but a summary (e.g., a Markdown document) explaining where the relevant logic resides. This is also where you as a developer can participate. Using data-driven approaches like Software Analytics can help to have rigid Data Science Practices in place for analyzing software systems on a large scale. Also, you can enrich the codebase or the summary beforehand to guide an agent through it (e.g., signaling where outdated parts as well as no-go areas are and where code is that fits the current ideas of the software system).
Phase 2: Plan (Intent Compression): Based on the research, the agent creates a detailed plan of which files need to be changed and how. This plan represents the “compressed intent” of the modification. Personal tip from me: Make sure you scope down those activities. Also: in the best case you can switch from agentic workloads towards rule-based search and replace workloads, letting an agent craft change recipes and execute those deterministically instead of in a non-deterministic way.
Phase 3: Implement (Coding): Only now does the agent change or write code, based strictly on the approved plan or on your deterministic transformation rules.

Why should you do this? If the plan is non-existent, wrong or too vague, 1,000 lines of generated code are worthless and changed code is tedious to review because of plenty of sloppy code. Therefore, invest human intelligence in reviewing the research and planning steps to create AI and human alignment early on and not just at the end with reviewing the final code.

Never attempt a complex migration (e.g., a COBOL file to Java) in a single “One-Shot” prompt. The team at OpenHands demonstrated that this almost always leads to hallucinations.

Instead, use an iterative loop with specialized roles:

Engineer Agent: Attempts to solve the task (e.g., migrating code).
Critic Agent: A separate agent that only reads. It analyzes the generated code, runs tests, and provides harsh feedback (scores).

The process runs in loops: The Engineer delivers -> The Critic evaluates and sends feedback back -> The Engineer improves -> Repeat until a quality standard is met.

I’m personally interested in automating as much as possible by providing immediate feedback within the agentic loop. You want to leverage mechanisms like compilation errors, code duplication detection, and architectural violations as key levers to guide the agent mechanically.

3. Invest in “Codebase Hygiene” First

AI is not a magic wand that turns bad code into good code. The Stanford study by Yegor Denisov-Blanch shows a clear correlation: In clean environments (high test coverage, good modularity, typing), AI can autonomously drive a large share of sprint tasks.

In “dirty” environments (high entropy, technical debt), the AI struggles, produces more errors, and can actually accelerate technical debt (the “Rework” trap). Before scaling AI, you must clean up the foundation. This is what we developers have felt for decades: clean code amplified dev productivity, and now AI amplifies those gains.

Personally, I’m a big fan of enriching codebases with more semantic meaning. Renaming cryptic one-letter variables to reflect the actual technical or business domain is a high-leverage move and in many cases a no-brainer. Building out higher level concepts or refactoring towards well-known patterns or idioms is also something I’m very into. Most of those activities are safe refactorings, meaning they usually don’t break the code (unless you’re storing code or class names in the database or using reflection voodoo in Java).

4. Practice Active Context Compaction

When an agent strays off the path, the human impulse is often to correct it within the same chat (“No, do it differently,” “That was wrong”). This is a mistake. Every failed attempt clutters the context window with “noise.”

A better approach is active context compaction (or as I call it: context reset and starting over):

Have the agent summarize the current state and findings into a compact file (state.md or the like).
Start a completely new chat with a fresh context.
Feed only the summary in as the starting point.

This keeps the agent in the “Smart Zone” of its context window.

I actually need to do this for some side projects. There I’m using SOTA models from DeepSeek, Minimax or Moonshot with Claude Code. I find those really refreshing but they are limited regarding the context window. So my workflow needs to use active context compaction and rigid, external management of the current state and the next steps to get good results from these LLMs.

5. Maintain “Traceability Links”

When migrating legacy code (e.g., COBOL to Java), the connection to the original business logic must never be lost. OpenHands recommends that the agent insert comments in the new code that link exactly to the line numbers of the old code where that logic originated. This is essential for future debugging and audits.

When I use, e.g., graph analytics on the whole codebase or create flowcharts for interesting parts of code, I also want to make sure that those results are correct. For this, I use simple line numbers or identifiers or file names in the generated outputs so that I can do a quick check if they are not hallucinated and that I then have something I can work with.

Traps to Avoid

1. Falling into the “Vibe Coding” Trap

“Vibe Coding” describes the back-and-forth chatting with a model, guided more by feelings than by specifications (“Make that prettier,” “No, that feels wrong”). This leads to bloated context windows and confused models. AI Engineering in legacy system environments requires precision, not “vibes.”

So don’t get lost in trying to convince an AI agent to work on legacy code as if it were a greenfield project - the agent sees years of old habits manifested in code and needs different guidance.

2. Underestimating “Rework”

The Stanford studies (listed below) clearly show that while AI tools increase output (more Pull Requests), they often dramatically increase rework: the time developers spend repairing or rewriting AI-generated code. If you only look at speed/volume, you miss the massive cost of quality assurance.

As mentioned above, try to automate as much as possible to get rid of manual work.

3. Rely Blindly on Line-by-Line Code Reviews

In a world where an agent can generate 20,000 lines of TypeScript code in minutes, traditional human line-by-line review is no longer scalable.

Do not rely solely on reviewing the final product. The Hierarchy of Leverage from Dex Horthy states: 1 Bad Line of Plan == 100 Bad Lines of Code. Shift the focus of human review “left”, to the research results and the plan, before the code is even written.

I also like to even go a step further: in the research stage, take a look at how to systematically get to the refactoring spots. If you can then come up with rule-based changes, most of them are then structurally equal over the completed codebases. This means that you don’t have to review line-by-line, but change pattern by change pattern, which is very efficient to do in a short amount of time.

4. Expect Magic in Niche Languages

AI model performance depends heavily on training data. For popular languages (Python, Java, JS), they work excellently. For niche languages or very old dialects (specific COBOL variants, obscure DSLs, and my new favorite one: RPG), using AI can actually decrease productivity according to Stanford data, because the agent hallucinates and the human spends all their time correcting it.

To know beforehand how I have to approach a legacy modernization project, I like to take a look at the corresponding tags on StackOverflow and the TIOBE programming popularity index. Every language that was not under the top 10 in the last few years needs to take a different approach. Maybe a broader reverse engineering towards specs or tests is needed or you even want to use a more traditional transpiler approach that could get the niche language converted to a more popular programming language level that an AI agent then can work with.

Conclusion: From Coder to Architect of Intent

Agentic Software Modernization works, but it requires discipline. The role of the human developer is shifting. We are becoming less writers of syntax and more the architects of the intent.

How I like to look at the current situation regarding agentic coding is that, yes, you could look at AI agents in greenfield projects as overmotivated junior developers, but in legacy systems environments, I like to think of AI agents as new senior developers to your company that can do really amazing things but need decent onboarding: step-by-step introduction to the system, letting them know the background of the existing code and carefully introducing them to the nasty parts of the systems over time.

I think with this image of AI agents and the Do’s and Don’ts from above, we can expect fewer complete disasters using AI Agents to tame complex legacy systems. And remember: Those who just click “Refactor all this” will end up in chaos.

📚 Sources & Further Watching

Calvin Smith / OpenHands: Refactoring COBOL to Java with Agentic AI with an Iterative Refinement Workflow
- Topics: Iterative Refinement, Critic Agents, Traceability Links
Dex Horthy (HumanLayer): Context Engineering SF: Advanced Context Engineering for Agents
- Topics: Hierarchy of Leverage, 1 Bad Line of Plan vs Code
Dex Horthy (HumanLayer): No Vibes Allowed: Solving Hard Problems in Complex Codebases
- Topics: RPI Workflow (Research, Plan, Implement), Context Compaction
Yegor Denisov-Blanch (Stanford): Can you prove AI ROI in Software Eng? (Stanford 120k Devs Study)
- Topics: Measuring ROI, The danger of “Rework”, Entropy in codebases
Yegor Denisov-Blanch (Stanford): Does AI Actually Boost Developer Productivity? (100k Devs Study)
- Topics: Productivity stats, Niche vs. Popular languages, Codebase Hygiene