Something changed in how I think about building with AI. It happened gradually, then all at once.
Last month, I watched a system I built spend forty minutes researching a topic, hitting dead ends, backtracking, trying different search queries, and eventually producing an analysis I genuinely couldn't have written faster myself. I didn't tell it how to recover from those dead ends. It figured that out.
That experience shifted something for me.
The Limits of Asking Nicely
For two years, most of us have been treating language models like very smart interns who need extremely detailed instructions. We craft a prompt. We send it. We get something back. If it's wrong, we rewrite the prompt and try again.
This works surprisingly well for straightforward tasks. Need a summary? Done. Want help with an email? Easy.
But I kept running into walls.
When I asked a model to "write a market analysis," I was really asking it to research the landscape, identify key players, find recent developments, synthesize patterns, and present conclusions coherently. That's five or six distinct cognitive tasks, compressed into one generation step. No wonder the results often felt shallow.
The model wasn't stupid. I was asking it to do too much at once.
What Changes When Models Can Loop
The fix seems obvious in retrospect: let the model work in stages.
Instead of asking for a finished analysis, I started asking for a research plan first. Then I'd let it execute that plan step by step, checking results as it went. When something didn't work, it could try a different approach.
This is what people mean by "agentic workflows." The model doesn't just respond. It pursues a goal across multiple steps, making decisions along the way about what to do next.
I've found it helpful to think about this as a spectrum:
Level one: The model decides what words to generate. This is standard prompting.
Level two: The model decides which tools to use and when. It might search the web, run some code, or query a database based on what it encounters.
Level three: The model decides how to structure the problem itself. It might break a task into subtasks, realize it needs information it doesn't have, or change its approach entirely when something isn't working.
Most interesting applications I've seen recently operate somewhere between levels two and three.
Four Patterns That Keep Showing Up
Talking to teams building these systems, I've noticed the same design patterns appearing over and over.
Reflection
The simplest pattern, and maybe the most underrated. After generating something, the model reviews its own work and revises.
I tested this on a coding task last week. Single-pass generation got the answer right about 60% of the time. Adding a simple "check your work and fix any errors" step pushed that to nearly 85%. The model caught its own mistakes.
Andrew Ng has been talking about this publicly. His claim: the performance gains from adding iteration often exceed what you'd get from upgrading to a more capable model. My informal experiments support this, though I'd want to see more rigorous benchmarks.
Tool Use
Language models know a lot, but their knowledge has a cutoff date, and they can't verify claims against reality. Giving them access to search, databases, and code execution changes what's possible.
I've started thinking of tools as the model's connection to ground truth. Without them, it's reasoning in a vacuum. With them, it can check its assumptions.
Planning
Complex tasks need decomposition. A planning step asks the model to think through how it will approach a problem before diving in.
This feels awkward at first. Why spend tokens on planning when you could just start working? But I've found that explicit planning catches structural problems early. The model notices when a task has dependencies, or when it needs information it doesn't have yet.
Multi-Agent Collaboration
Some teams are experimenting with multiple specialized agents working together. One agent researches while another critiques. One generates while another evaluates.
I'm less convinced this is always necessary. For many tasks, a single agent with good tools and reflection seems sufficient. But for genuinely complex problems with competing considerations, the multi-agent setup can surface perspectives that a single model might miss.
The Uncomfortable Parts
I should be honest about what isn't working yet.
Debugging is genuinely hard. When a multi-step workflow produces a wrong answer, figuring out where things went wrong requires tracing through a sequence of reasoning steps. The "stack trace" is natural language, not code. I've spent hours on problems that would take minutes to diagnose in traditional software.
Costs add up fast. My agentic workflows typically use 5-8x more tokens than equivalent single-pass prompts. For prototyping, that's fine. For production applications with real volume, the economics get tricky.
Predictability drops. The same input might produce different outputs on different runs. The model might take different paths, use different tools, or make different intermediate decisions. For some applications, this variability is a feature. For others, it's a serious problem.
Evaluation is an open question. How do you measure whether an agentic system is "good"? Accuracy on the final answer captures some of it, but misses the quality of the reasoning process, the efficiency of the approach, and the robustness to edge cases.
Teams I trust recommend starting simple. Get reflection working before adding tools. Master single-agent patterns before attempting multi-agent setups. Build in observability from the start, because you will need to understand what these systems are actually doing.
Where I Think This Is Heading
I don't want to make predictions I can't back up. But I'll share what I'm paying attention to.
The barrier to building these systems has dropped dramatically in the past six months. Frameworks like LangGraph and AutoGen handle orchestration plumbing that used to require custom engineering. What took weeks now takes days, sometimes hours.
This means more people are experimenting. More experiments means faster learning about what works.
I'm also noticing a shift in how I think about problems. I used to ask "how do I write a prompt that gets a good answer?" Now I ask "how do I design a process that reliably produces good answers?" The unit of design is expanding from the individual prompt to the overall workflow.
That feels like a meaningful change.
The tools are getting better. The real question is whether we're getting better at using them.