Mitigating Failures in AI Agents and Multiagent Systems
Probabilistic Outputs, Failures States, and Degree of Trust
Eric Schmidt, former head of Google and now ubiquitous AI policy talking head1, has given his predictions for AI in the coming (post-summer 2024) 18-24 months:
Infinite context windows: AI models that can have a continuous state that expands indefinitely, into which effectively unlimited tokens can be entered.2
Autonomous Agents: built on LLMs and other software, that will be capable of going out and taking actions on behalf of users over the web.
Text to Action: “Can you imagine having programmers that actually do what you say you want?” AI systems like Claude Sonnet 3.5 and ChatGPT 4o can already do this up to a point, but end-to-end software development and deployment is still not quite ready. Schmidt thinks this will come soon, with you able to describe a program and get a deployable system in seconds if not minutes.
This is an essay focused on taking Schmidt’s prognostication seriously, under the rubric of AI agents and multiagent systems. My aim is to provide readers with a framework for thinking about AI agents, where they can be useful, issues with using them, and provide a framework for experimenting and finding the right level of evolving trust for these tools as they become more ubiquitous and more capable.
Mistakes
First of all, let’s get it out there: AI agents are going to make mistakes. Imagine these near-future scenarios:
Your AI personal assistant that you tasked with ordering a grocery delivery uses an old recipe you gave it when selecting ingredients, not the new one you prefer
A bidding agent a company tasks with placing bids for a government contract includes the wrong documentation in its bid
An AI radiology system misses a pre-cancerous mass in a patient’s MRI scan
A fully self-driving taxi changes lanes suddenly and has a head on collision with a delivery truck.
A high-frequency stock trading system mistimes a transaction and loses millions of dollars on a trade
There’s a flippant answer to all this that human beings make mistakes too, and we seem to have survived that just fine.
These mistakes can happen for many reasons. There can be garbage-in garbage-out problems where a human makes a mistake, and the AI does not have sufficient information or initiative or permission to halt its action. There can be component problems where some physical part of a process sends erroneous data, like a faulty sensor. These can even be architectural problems where a multiagent system is configured incorrectly. And there can be adversarial problems where hackers - whether motivated by profit or thrills - jailbreak AI systems to go against their programming.
But even if we had perfect means for dealing with all the above, we still have to reckon with the probabilistic nature of the deep learning models on which all our AI agents are going - in part, at least - to be based. Probability and randomness are at the heart of how deep learning systems work, especially in large language models (LLMs) and Transformers. If you have a technical background but don't consider yourself a maths expert, understanding this aspect is key to appreciating both the strengths and limitations of modern AI and the future AI systems that will be developed out of them.
Deep learning, particularly in LLMs, revolves around making educated guesses based on patterns found in enormous datasets. Unlike systems that rely on fixed, deterministic rules, these models operate on flexible, probability-based methods. Here's what that looks like in practice:
Training Process: Assuming you are familiar with neural networks, let's just jump to saying that the parameters (the weights of the ‘neurons’ in the network) are randomly initialized. This randomness allows the model to explore different possible solutions and avoid getting trapped in suboptimal ones. During training, a technique called stochastic gradient descent deliberately uses randomness to enhance learning. One good way to think about this technique is how you could approach the task of packing a box with many differently shaped objects. You could carefully plan out and pack the box, or you could toss everything in and shake it. The shaking pushes the 'system' of items to a reasonable optimum, and sufficiently vigorous shaking can make the fit better and better over time. You might not achieve the absolutely optimal packing, but you will find good configurations and save time.
Data Sampling: LLMs don’t process all the training data at once. Instead, they randomly sample batches of data in each iteration. This randomness helps the model generalize better, preventing it from overfitting to specific patterns. Sequentially handling the data could overlearn the specific patterns in the data, rather than finding the hidden variables that determine the distribution of the information.
Token Prediction: When generating text, LLMs don't always choose the most likely next word. Instead, they sample from a probability distribution of possible next words, which adds an element of creativity and unpredictability, making the outputs more varied and human-like. This is part of what makes them such excellent explainers of concepts - they can explore multiple ways to explain the same concept, as correct explanations lie within the same distribution space as each other.
Attention Mechanisms: In Transformers, the attention mechanism decides which parts of the input are most relevant when generating an output. This decision is made through probabilistic weighting, which allows the model to focus on different aspects of the input with varying importance. Rather than an all-or-nothing judgement, the model can pay attention to how different parts of input are relevant to different degrees.
This essentially probabilistic nature of deep learning models is both a strength and a limitation. It helps models handle ambiguity, produce creative responses, and generalize to unseen inputs. But it also means there's always some ineliminable chance of error or unexpected behavior. This uncertainty isn’t a flaw that can be entirely eliminated; it’s an intrinsic feature of how these systems function.
And it is something multiagent AI system designers, and users, are going to have to learn to cope with.
Stacking Probabilities
Any AI agent is going to carry out plans, either singly or together with other agents in a multiagent system. Given that they are irreducibly going to have an error rate, that means we’re going to have the problem of stacking probabilities. When each step has a probability of success of even 99% (only 1 error out of 100 trials), the probability of the whole sequence succeeding, without the needed for oversight or correction, decreases at a decent clip.
Out to Step 22, the probability of success declines to ~80%. Good enough for government work, perhaps, but by Step 68 the probability is 50%, and it only gets worse from there.
Fortunately, there are a multitude of ways to tackle the problem of error when working with AI agents. We can use larger models, which have lower rates of error and better task adherence across multiple steps (if you can get to a 99.9999% success rate, it takes 700,000 steps for the probability of completion to drop to 50%). We can use larger context windows, which are capable of keeping more of the overall problem ‘in memory’ at each step. We can add retrieval augmented generation (RAG) to supply the models with additional info rather than asking them to generate purely from their initial setup (this pairs well with larger context windows). And we can use optimized system prompts that keep the system focused. In summary, there are multiple angles from which to attack the accuracy problem.
But there will still be failures. This is where design choices come in, specifically designing systems to mitigate failures when they do occur. And here we can draw on a large literature of failsafe designs.
Failure Mitigation Strategies
To give you an idea of the diversity of options we can choose, here’s a few I brainstormed (with the help of a team of LLMs).3
Redundancy / Voting Systems: Implement multiple AI agents, potentially with different algorithms or trained on diverse data sets, to perform the same task independently. By introducing redundancy, if one agent fails or makes an error, others can compensate. The outputs from these agents are compared using a voting or consensus mechanism—such as majority voting, averaging, or selecting the result with the highest confidence score. The system proceeds with the consensus decision, reducing the impact of individual errors and enhancing overall reliability. Agents that were not part of the consensus can be replaced or restarted and given the current context of the consensus decision to start with for the next step.
Graceful Degradation: Design systems to maintain core functionality even when some components fail, allowing for a graceful reduction in capabilities rather than a complete shutdown. Prioritize critical functions to ensure they continue operating, albeit with reduced performance if necessary. For NPC companions in a game, for instance, the failure of the verbal module could lead the model to become silent temporarily, while the path following and combat decision making systems could continue to operate, rather than crashing out the whole game.
Confidence-Based Action Thresholds: Implement a mechanism where AI agents report their confidence levels for each decision or action. Only actions with confidence levels above a predefined threshold proceed automatically. If the confidence level falls below this threshold, the agent flags the decision for human review or escalates it to another system for verification. This approach ensures that uncertain or potentially risky decisions receive additional scrutiny.
Pre-Defined Guardrails and Constraints: Establish hard-coded constraints, guidelines, or ethical boundaries that AI agents must adhere to during training and operation. These guardrails define the limits within which the AI can function, preventing it from engaging in unsafe, unethical, or undesirable behaviors. In real-time or mission-critical environments. For example, an agent that negotiates could be given a preset budget, and not allowed to make offers above that level. Even if the LLM generates an offer that is too high, a deterministic program could catch the error and stop the offer from being made.
Human-in-the-Loop Design: Integrate human oversight into the AI system to monitor performance and intervene when necessary. AI agents can collaborate with human experts, suggesting options but deferring final decisions to humans in complex or uncertain situations. Design clear handoff protocols to ensure smooth transitions between AI and human control, enhancing trust and reliability. Autonomous vehicle taxi companies like Waymo currently implement this design successfully.
Anomaly Detection and Error Correction: Implement monitoring systems to detect unusual patterns, behaviours, or anomalies in AI agent outputs, not just one-off out of bounds actions. These systems can trigger alerts, initiate fallback mechanisms, or prompt human intervention when potential failures are detected. The systems could be set up to detect more subtle anomalies that manifest over time - like an AI agent that regularly fails a particular kind of tool deployment or mathematical reasoning.
Hierarchical Control Structures: Organize AI agents within a hierarchical framework where higher-level agents oversee and coordinate the actions of lower-level agents. These higher-level agents break down complex tasks into smaller, manageable subtasks, assigning each to specialized agents. Hierarchical decision-making allows for oversight, error detection, and adjustment at each level. Higher-level agents can reallocate resources, modify strategies, or intervene when lower-level agents encounter problems, enhancing overall system reliability and adaptability.
Staged Decision Making: Rather than a single agent carrying out a workflow from beginning to end, have the first - perhaps most powerful - agent develop the plan, and then parcel the plan out to a sequence of specialized agents that operate sequentially, each handing off its output to the others input. The whole sequence can be monitored by a higher-level agent whose sole task is ensure successful completion of the workflow.
Progressive Exploration: Implement an incremental approach to exploring new solutions, treating each action as an experiment. The AI agent measures performance at each step, pausing and recalibrating if it deviates from acceptable parameters, with the capacity to rollback to earlier stages, incorporating knowledge of the error. This method is particularly useful in high-stakes environments like drug discovery or scientific research, where controlled exploration can lead to innovative solutions, but where expensive failures need to be avoided.
Failure Handling Agents: Design specialized agents dedicated to handling specific types of failures. If the main AI system encounters a problem, a "recovery agent" steps in to diagnose the issue, attempt rerouting, or facilitate a handoff to a human operator. These failover agents are typically simpler but are crucial for restoring the system to a safe or known state, enhancing resilience.
Multi-Agent Negotiation and Check-In Systems: In collaborative environments involving multiple agents, implement systems where agents periodically negotiate their states and progress, and check-in with each other, similar to a Scrum session in Agile project management. If one agent identifies a problem, others can dynamically adjust their plans, redistribute tasks, or compensate for the failure. This cooperative strategy reduces the impact of individual agent failures and enhances overall system robustness.
Adversarial Testing: Set up adversary AI agents and multiagent networks, whose sole job is to test production systems, either before deployment or at random intervals during live operation.
Time-Boxed Operation: Define strict time limits for AI operations. If a task isn't completed within the allocated time frame, trigger a review or fallback mechanism. This prevents the system from getting stuck in unproductive states and ensures timely responses, which is critical in real-time applications.
And these are just a few of the ideas I brainstormed (with assistance from ChatGPT, Claude Sonnet, and Gemini). All of them can be mixed-and-matched. My point is that there is a wealth of options for system engineers - and you, personally - to use when coming up with agents and multiagent designs.
Which methods, or collections of methods, are appropriate for which contexts is a matter of human judgement. The fundamental thing to remember here, both for those designing systems and for those using them, is that “technological failure” is almost always organizational failure: some human judgment was made to deploy X technology in context Y. Whether something is ‘safe’ is a judgement call by a group of people, and that judgement can be right or wrong. For certain purposes, it is okay for a technology to be experimental and high-risk. For others, it's vital that the deployed technology be as close to fault proof as it can be made.
I can’t give any kind of prior judgement on which of these techniques are appropriate for which contexts. It’s going to depend on what the precise AI technologies available at the time of development are, their strengths and weaknesses, and the nature of the task at hand. Certain highly robust failure mitigation techniques - like redundancy and confidence-based action thresholds - would be inappropriate for controlling a video game NPC but would be ideal for handling safety monitoring in a hydroelectric dam.
Why Explainability is Less Important than Reliability
One of the public talking points about deep learning-based AI models is that they are “black boxes” - that we really can’t know how they arrive at the outputs they do. This is in contrast to traditional software, where it is possible - if very painful and tedious (I speak from experience) - to trace out exactly how outcome Y followed from input X. This is due to the same features of deep learning models we listed above: the sheer size of the networks, and their probabilistic operations, make it unfeasible to get perfect answers to questions about why they operate as they do. And the further claim is that until these models can be made to ‘explain’ themselves, or the reasons for their outputs interpretable, this is a hard-stop on the adoption of AI systems.
It’s clear why people would want explainable AI: for trust and transparency, for regulatory compliance, for bias detection, and for model improvement. Of all of these, I think trust is the most important. But when you think hard about the concept of trust, its not clear that explainability is what you need to have for trust. Because we just don’t trust each other based on our ability to explain why we do what we do. We trust one another because of experience.
Mustafa Suleyman raised a good point about this at the Aspen Ideas Festival. He was discussing new radiology models that exceed human experts at interpreting MRI images, detecting cancers and other health effects that experienced professionals missed, with fewer false positives than those same human experts. Why would it matter, Suleyman said, when it comes to trusting the system, exactly how it is doing what it is doing, when even the human experts cannot really tell you that - they cannot say what is going on in their own brains that helps them make the judgments they do, other than lots and lots of exposure to MRI images. And exposure to images is precisely what the deep learning model has, only the deep learning model has orders of magnitude more experience. What makes you trust someone in that position, Suleyman says, is their accuracy, not their ability to explain their reasoning. Especially because you are unlikely to be in a position to contest their explanation. Over time, Suleyman said, we will come to trust these systems as we see that they are highly accurate, and more accurate than humans.
I think Suleyman is right, and we shouldn’t let issues with explainability get in the way of deploying AI agents for valuable purposes, especially any area where the costs of failure are low or easily recoverable (think of getting a second opinion in medicine). Smart system designs, like those listed above, by breaking up tasks into smaller components, can also make it easier to diagnose where an error is happening and correct it. Again, we’re not talking about using a single model and trusting it completely, we’re talking about building networks of agents that operate together - by being modular, these systems will be more easily diagnosable and controllable. The precise reason for a failure in a particular node of the system is not that important if we can swap it for a more reliable part.
Levels of Trust
I want to build on Suleyman’s idea about trust gain over time. Trust in AI agents, like trust in people, is not a binary state; it evolves across levels as systems prove themselves in increasingly complex contexts.
I propose the following framework for developing trust in AI agents, as you use tools built by others or develop solutions of your own.:
Level 1: Human review of all outputs. This is the current state we are at with systems like ChatGPT or Claude. The models are capable of impressive feats of coding, writing, and reasoning, but user’s ought to check and verify their outputs before deploying them. If you’re asking ChatGPT to build you some code based off an API, you should run tests on that code rather than trusting the AI to check its own work.
Level 2: Trust in low-level outputs while reviewing higher-level syntheses. Imagine a unit tasked with monitoring an enemy country’s WMD program. They might trust their agents to run searches and reviews of each of their channels of info (ELINT, HUMINT, SIGINT, IMINT, GEOINT, OSINT, and so on) but not trust any syntheses the system made of the data. At this level of trust, it might be a good idea to have parallel processes: a human team or teams working on the data, and then checking their synthesis of the sources into a judgment against the AI system. Divergences of opinion by either side would at the least be good grounds for discussion and developing new questions to ask.
Level 3: Trust in syntheses but review the plans based on them. Here we would trust the syntheses of the results of lower-level agents - but want to approve any plans for actions in the real world. In drug discovery, this might be trusting the agents with the scientific literature review and candidate molecule generation steps, as well as the plans for how to synthesize the candidates economically, but not allowing it to execute complex chemical syntheses unsupervised. Trust it to have good ideas about what to build, in other words, but don’t trust it to have the best ideas about how to command the robots in the chem lab to carry out the synthesis.
Level 4: Trust the plans while monitoring their implementation. To take an example from logistics, a manager might trust AI agents and systems of them to handle optimizing warehouse layouts and logistic routes, as well as the plans for navigating autonomous trucks and operating cargo loading robots, but the running of the system would be monitored in real time by an operations room, with human operators ready to jump in if anything goes wrong or there are unexpected issues (a rainstorm that knocks out a bridge, a robot that spills marbles on the floor, etc.)
Level 5: Full trust in implementation, with audits for performance over time. Here we trust the agents to not only analyze and synthesize information, to develop plans and execute them, but we allow them to do so without our regular check-ins. Rather, we leave the fleets of agents to operate and review their performance at intervals. This would be the equivalent of the quarterly reviews that large organizations do of their operations.
Should we be cautious in adopting AI agents? Absolutely. Yet, that caution should not translate into avoidance or setting some binary pass/fail before we allow even the simplest, least powerful agents into our lives and our economy. We are in a multiplayer game—there are individuals and groups with the resources to absorb the risks of newly deploy agents and iterate rapidly, gaining a monopoly on mastery over these systems. If we want to prevent the concentration of power, it is essential to democratize knowledge and usage of AI agents, ensuring that as many people as possible become proficient in their deployment and management.
Daniel Miessler has aptly noted that “AI is already becoming like reading.” By that, he drew attention to the divergence in how much people at the top and bottom of the economic scale read. Charlie Munger, partner of Warren Buffet at Berkshire Hathaway, said he and Buffet knew of no truly wise person who did not read (and write) all the time - always expanding their knowledge, always seeking to figure out how to be better, always looking for new tools and possibilities. The causal connection here is not the one the popular business press - and clickbait YouTube videos and social media posts - make it out to be. People like Bill Gates, Warren Buffet, and Jeff Bezos aren’t where they are because of reading - they read because they have the traits that make them voraciously seek out new information and possibilities. Growth mindset isn’t just pop psychology: it’s a real thing some people have, and other people don’t, and maybe it is something that can be cultivated. At least, for the sake of most people, I hope it is.
Miessler’s point is that AI tools have passed the point of public attention, become a part of the background noise rather than something people are still investigating and learninga bout. He is worried that in the near future there will be two groups of people: a Group 1 who use AI systems only to carry out chores or find entertainment, and a Group 2 who deeply understand AI agents and multiagent systems, and use them to expand their own knowledge, influence, and power. This is going to create an even greater divergence in human capabilities than already exists: imagine a person who has not just a more capable Siri in their phone that can tell them jokes or book appointments for them, but someone who has 10,000 agents with overseer agents supervising lower level agents to carry out vast ranges of tasks, from optimizing their financial portfolio to monitoring their health to advocating to government on behalf of the causes they care about (and many more things we can’t yet imagine).
This divergence in capability will have profound implications for power dynamics. The best way to counter this is by expanding access to AI knowledge. “Democratizing AI” is a pretty term that is getting a lot of traction in the circles I’m in, but it is desperately vague. When I press people on this, they mean something like making sure the voices of as many people as possible are heard in shaping AI regulations. If you’ve read Walter Lippmann, you know this is a fool’s errand. Also, it’s not even clear whether it makes any sense to regulate an industrial revolution that is only just starting to happen. I think informing people of what the tools that are now available can do, and will be able to do soon, is much more important.
To put it another way, you can be part of making the future by being in Group 2, or you can be in Group 1 and live in the future those in Group 2 makes for you. There will of course be a horse to water problem - as I used to say when I was teaching assistant during my graduate studies “you can lead a student to knowledge, but you can’t make them think.” Many people, due to sloth or fear or sheer lack of intellectual ability won’t be able to take advantage of these tools. I’ve previously written about how it’s unlikely many people would know from the start how to make full use of an AI personal assistant, but this is something that civil society groups could get a jump start on teaching people. We can at least ensure as few people as possible get left behind.
The closing message, then, is this: learn as much as you can about AI agents and start experimenting now. Understanding these systems and their potential failures will be key to harnessing their power effectively and ensuring that AI doesn’t just become a tool for the few but a transformative force for the many.
As opposed to AI-generated talking head, which I’m sure we’re going to see more of in the future
If you want a preview of what this is like, head to Google’s AI Studio, where you can play with their most advanced models 2,000,000-token context window for free).Seriously, the amount of free usage Google allows with their API is beyond insane. If you are doing any experiments on AI systems design and cognitive architectures that aren’t at enterprise scale, do yourself a favour and go check it out.
For those curious about the method: I first brainstormed on my own, then gave my list and a request for more ideas to ChatGPT, Claude Sonnet, and Gemini Pro 1.5. I then read over their outputs and combined them into a single document then gave this document to o1-preview. o1-preview identified the shared concepts, synthesized the explanations of each model, and listed those unique ideas each model suggested. I then gave the full list a review and listed the ones I thought were promising.