The Tech Sales Newsletter #93: AI Agents 2.0
Source: “The Era of Experience” whitepaper
One of the interesting parts of the AI revolution is that most of the groundbreaking work is happening in "research labs" rather than just internal business units. This is driven by the fact that a lot of the work needed is about fundamental research (solving new scientific problems), rather than pure product development (here are 3 new features).
AI is also a bit unique in the context of needing a lot of interdisciplinary collaboration. The labs would often employ researchers with significantly different areas of expertise such as mathematicians, computer scientists, neuroscientists, linguists, and others.
While the "San Francisco consensus" today is that we are close to reaching AGI, the reality is that we probably need one more leap in scientific discovery to actually land there. Since the existing models have mostly exhausted the training data available, the AI community has been in intense discussions and experimentations with "what's next".
Today I'll cover what that means for tech sales.
The key takeaway
For tech sales: The next disruption in AI will likely be less about efficiency (like the "DeepSeek" moment) and more about a big leap in capability. The most logical path right now is through implementing agents and expanding their scope. The cutting edge of research is pointing towards what we can think of as "super-agents" rather than minor automated features. The big question for any rep is, "how well is my company positioned to adapt to this shift?"
For investors: There are only a few players right now that will bring the next leap of capability in tech, and most aren't even available for investing. It's difficult to look past either Alphabet as the first player to get to self-training agents or NVIDIA as the key technology provider for OpenAI and xAI.
The Era of Experience
We stand on the threshold of a new era in artificial intelligence that promises to achieve an unprece-dented level of ability. A new generation of agents will acquire superhuman capabilities by learning pre-dominantly from experience.
"The Era of Experience" is a recently published white paper that builds upon some of the recent progress on AI agents and offers a new vision for how to scale them. It was written within DeepMind - the leading corporate laboratory that operates as part of Alphabet.
While imitating humans is enough to reproduce many human capabilities to a competent level, this approach in isolation has not and likely cannot achieve superhuman intelligence across many important topics and tasks. In key domains such as mathematics, coding, and science, the knowledge extracted from human data is rapidly approaching a limit. The majority of high-quality data sources - those that can actually improve a strong agent’s performance - have either already been, or soon will be consumed. The pace of progress driven solely by supervised learning from human data is demonstrably slowing, signalling the need for a new approach. Furthermore, valuable new insights, such as new theorems, technologies or scientific breakthroughs, lie beyond the current boundaries of human understanding and cannot be captured by existing human data.
One of the primary complaints with the major new models is that they felt very iterative, rather than groundbreaking. In fact, for many, the primary benefits from new launches were product-related (research functionality, web browsing, image generation), rather than "the model being smarter".
To progress significantly further, a new source of data is required. This data must be generated in a way that continually improves as the agent becomes stronger; any static procedure for synthetically generating data will quickly become outstripped. This can be achieved by allowing agents to learn continually from their own experience, i.e., data that is generated by the agent interacting with its environment. AI is at the cusp of a new period in which experience will become the dominant medium of improvement and ultimately dwarf the scale of human data used in today’s systems.
One of the primary arguments about the Nvidia moat is driven by the need for large amounts of synthetic data, particularly when it comes to training models on "real-world" scenarios such as autonomous driving. This, however, does not actually mean that the models would improve cognitively. As we all know, you don't have to be extremely intelligent and capable in all realms to be able to drive a car at the highest level.
AlphaProof recently became the first program to achieve a medal in the International Mathematical Olympiad, eclipsing the performance of human-centric approaches. Initially exposed to around a hundred thousand formal proofs, created over many years by human mathematicians, AlphaProof’s reinforcement learning (RL) algorithm1 subsequently generated a hundred million more through continual interaction with a formal proving system. This focus on interactive experience allowed AlphaProof to explore mathematical possibilities beyond the confines of pre-existing formal proofs, so as to discover solutions to novel and challenging problems. Informal mathematics has also achieved success by replacing expert generated data with self-generated data; for example, recent work from DeepSeek “underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies.”
Our contention is that incredible new capabilities will arise once the full potential of experiential learning is harnessed. This era of experience will likely be characterized by agents and environments that, in addition to learning from vast quantities of experiential data, will break through the limitations of human-centric AI systems in several further dimensions:
• Agents will inhabit streams of experience, rather than short snippets of interaction.
• Their actions and observations will be richly grounded in the environment, rather than interacting via human dialogue alone.
• Their rewards will be grounded in their experience of the environment, rather than coming from human prejudgement.
• They will plan and/or reason about experience, rather than reasoning solely in human terms
It's widely accepted that the most powerful Enterprise application of LLMs is in the context of coding. This was not "obvious" two years ago. LLMs running on transformers architecture are about processing and predicting language, not computer interaction. Their coding capabilities were initially a byproduct of being trained on human data (and code being a significant footprint of training data). Afterwards it became a business case, but also an experimental research path.
“How to sell AI” v1.7 is now out, giving you a deep dive on what a modern AI Coding agent looks like. Sign up today and get access to:
The definitive tech sales guide to selling AI (LLMs and Enterprise-grade Machine Learning).
When we fed ChatGPT 3.5 or Claude all of the human knowledge available on the internet, they did not suddenly become able to think. Logically speaking, that meant that purely teaching them to analyze and interpret language does not replace logical thought. So the next step was to teach them math, through the "applied math language" of code.
What the DeepMind team is exploring here is how, if we want to take the next step in AI, we essentially want to let agents learn through the experience of interacting with both humans and computer code.
We believe that today’s technology, with appropriately chosen algorithms, already provides a sufficiently powerful foundation to achieve these breakthroughs. Furthermore, the pursuit of this agenda by the AI community will spur new innovations in these directions that rapidly progress AI towards truly superhuman agents.
AGI/ASI has to incorporate agentic behavior by definition. So getting to that level means that we need to build better AI agents.
An experiential agent can continue to learn throughout a lifetime. In the era of human data, language-based AI has largely focused on short interaction episodes: e.g., a user asks a question and (perhaps after a few thinking steps or tool-use actions) the agent responds.
Typically, little or no information carries over from one episode to the next, precluding any adaptation over time. Furthermore, the agent aims exclusively for outcomes within the current episode, such as directly answering a user’s question. In contrast, humans (and other animals) exist in an ongoing stream of actions and observations that continues for many years. Information is carried across the entire stream, and their behaviour adapts from past experiences to self-correct and improve.
Furthermore, goals may be specified in terms of actions and observations that stretch far into the future of the stream. For example, humans may select actions to achieve long-term goals like improving their health, learning a language, or achieving a scientific breakthrough.
Powerful agents should have their own stream of experience that progresses, like humans, over a long time-scale. This will allow agents to take actions to achieve future goals, and to continuously adapt over time to new patterns of behaviour. For example, a health and wellness agent connected to a user’s wearables could monitor sleep patterns, activity levels, and dietary habits over many months. It could then provide personalized recommendations, encouragement, and adjust its guidance based on long-term trends and the user’s specific health goals.
This touches upon the concept of a "context window" - essentially the length of the conversation that you can have with a model. Each new statement needs to account for the historical conversation, which is quite compute-intensive, and most models start to mess up conclusions even before the "end" of that context window.
The defining technical capability of agents would be the unlimited context window. Once created, it will keep accumulating all interactions and conversations in a single context window, always building on top of it.
Similarly, a personalized education agent could track a user’s progress in learning a new language, identify knowledge gaps, adapt to their learning style, and adjust its teaching methods over months or even years. Furthermore, a science agent could pursue ambitious goals, such as discovering a new material or reducing carbon dioxide. Such an agent could analyse real-world observations over an extended period, developing and running simulations, and suggesting real-world experiments or interventions.
In each case, the agent takes a sequence of steps so as to maximise long-term success with respect to the specified goal. An individual step may not provide any immediate benefit, or may even be detrimental in the short term, but may nevertheless contribute in aggregate to longer term success. This contrasts strongly with current AI systems that provide immediate responses to requests, without any ability to measure or optimise the future consequences of their actions on the environment.
It's easy to see how this would be incredibly useful for the end user. We are already using our phones like this (all our conversation history and pictures, stored in a single place). But imagine being able to get constant new insights and ideas, building on top of all the information the AI agent knows about you.
Agents in the era of experience will act autonomously in the real world. LLMs in the era of human data focused primarily on human privileged actions and observations that output text to a user, and input text from the user back into the agent. This differs markedly from natural intelligence, in which an animal interacts with its environment through motor control and sensors. While animals, and most notably humans, may communicate with other animals, this occurs through the same interface as other sensorimotor control rather than a privileged channel.
It has long been recognised that LLMs may also invoke actions in the digital world, for example by calling APIs. Initially, these capabilities came largely from human examples of tool-use, rather than from the experience of the agent. However, coding and tool-use capabilities have built increasingly upon execution feedback, where the agent actually runs code and observes what happens. Recently, a new wave of prototype agents have started to interact with computers in an even more general manner, by using the same interface that humans use to operate a computer. These changes herald a transition from exclusively human-privileged communication, to much more autonomous interactions where the agent is able to act independently in the world. Such agents will be able to actively explore the world, adapt to changing environments, and discover strategies that might never occur to a human.
These richer interactions will provide a means to autonomously understand and control the digital world. The agent may use ‘human friendly’ actions and observations such as user interfaces, that naturally facilitate communication and collaboration with the user. The agent may also take ‘machine-friendly’ actions that execute code and call APIs, allowing the agent to act autonomously in service of its goals. In the era of experience, agents will also interact with the real world via digital interfaces. For example, a scientific agent could monitor environmental sensors, remotely operate a telescope, or control a robotic arm in a laboratory to autonomously conduct experiments.
We are currently thinking about AI agents as potential "employees", able to be deployed to perform a variety of tasks on our behalf (and sometimes replacing us fully). This is still very much a "human-centric" approach to AI, which might or might not lead to the breakthrough we expect. If being able to access, create, and run code is a superpower that AI agents have, then letting them use that as part of their learning process opens up a completely different learning path.
Human-centric LLMs typically optimise for rewards based on human prejudgement: an expert observes the agent’s action and decides whether it is a good action, or picks the best agent action among multiple al- ternatives. For example, an expert may judge a health agent’s advice, an educational assistant’s teaching, or a scientist agent’s suggested experiment. The fact that these rewards or preferences are determined by humans in absence of their consequences, rather than measuring the effect of those actions on the environment, means that they are not directly grounded in the reality of the world. Relying on human prejudgement in this manner usually leads to an impenetrable ceiling on the agent’s performance: the agent cannot discover better strategies that are under-appreciated by the human rater. To discover new ideas that go far beyond existing human knowledge, it is instead necessary to use grounded rewards: signals that arise from the environment itself.
For example, a health assistant could ground the user’s health goals into a reward based on a combination of signals such as their resting heart rate, sleep duration, and activity levels, while an educational assistant could use exam results to provide a grounded reward for language learning. Similarly, a science agent with a goal to reduce global warming might use a reward based on empirical observations of carbon dioxide levels, while a goal to discover a stronger material might be grounded in a combination of measurements from a materials simulator, such as tensile strength or Young’s modulus.
Grounded rewards may arise from humans that are part of the agent’s environment. For example, a human user could report whether they found a cake tasty, how fatigued they are after exercising, or the level of pain from a headache, enabling an assistant agent to provide better recipes, refine its fitness suggestions, or improve its recommended medication. Such rewards measure the consequence of the agent’s actions within their environment, and should ultimately lead to better assistance than a human expert that prejudges a proposed cake recipe, exercise program, or treatment program.
One of the common concerns with developing AI is the fear that it will “go rogue” and hurt us. What happens if we design AI agents obsessed with helping us win?
Where do rewards come from, if not from human data? Once agents become connected to the world through rich action and observation spaces, there will be no shortage of grounded signals to provide a basis for reward. In fact, the world abounds with quantities such as cost, error rates, hunger, productivity, health metrics, climate metrics, profit, sales, exam results, success, visits, yields, stocks, likes, income, pleasure/pain, economic indicators, accuracy, power, distance, speed, efficiency, or energy consumption. In addition there are innumerable additional signals arising from the occurrence of specific events, or from features derived from raw sequences of observations and actions.
One could in principle create a variety of distinct agents, each optimising for one grounded signal as its reward. There is an argument that even a single such reward signal, optimised with great effectiveness, may be sufficient to induce broadly capable intelligence. This is because the achievement of a simple goal in a complex environment may often require a wide variety of skills to be mastered.
Probably one of the most confusing parts of training and building new models is related to "rewards". The idea being that, in order to nudge a more "human-like" behavior, we want to give AI a similar experience driven by motivation. By letting them "experience" the world through interaction, we create an infinite loop of potential incentives, not limited by our own imagination or capability.
Will the era of experience change the way that agents plan and reason? Recently, there has been significant progress using LLMs that can reason, or “think” with language, by following a chain of thought before outputting a response. Conceptually, LLMs can act as a universal computer: an LLM can append tokens into its own context, allowing it to execute arbitrary algorithms before outputting a final result.
In the era of human data, these reasoning methods have been explicitly designed to imitate human thought processes. For example, LLMs have been prompted to emit human-like chains of thought, imitate traces of human thinking, or to reinforce steps of thinking that match human examples. The reasoning process may be fine-tuned further to produce thinking traces that match the correct answer, as determined by human experts.
However, it is highly unlikely that human language provides the optimal instance of a universal computer. More efficient mechanisms of thought surely exist, using non-human languages that may for example utilise symbolic, distributed, continuous, or differentiable computations. A self-learning system can in principle discover or improve such approaches by learning how to think from experience. For example, AlphaProof learned to formally prove complex theorems in a manner quite different to human mathematicians.
This is a rather provocative statement from the researchers, but basically the idea is that the recent reasoning models such as o3 might be counterproductive to developing smarter AI agents long term. This is because we are essentially making them “ponder” over the question the way that we would approach it, rather than letting the model figure out an alternative solution by itself. When we do test this, we find models reaching improved outcomes from surprising angles.
An agent trained to imitate human thoughts or even to match human expert answers may inherit fallacious methods of thought deeply embedded within that data, such as flawed assumptions or inherent biases.
For example, if an agent had been trained to reason using human thoughts and expert answers from 5,000 years ago it may have reasoned about a physical problem in terms of animism; 1,000 years ago it may have reasoned in theistic terms; 300 years ago it may have reasoned in terms of Newtonian mechanics; and 50 years ago in terms of quantum mechanics. Progressing beyond each method of thought required interaction with the real world: making hypotheses, running experiments, observing results, and updating principles accordingly. Similarly, an agent must be grounded in real-world data in order to overturn fallacious methods of thought.
This grounding provides a feedback loop, allowing the agent to test its inherited assumptions against reality and discover new principles that are not limited by current, dominant modes of human thought. Without this grounding, an agent, no matter how sophisticated, will become an echo chamber of existing human knowledge. To move beyond this, agents must actively engage with the world, collect observational data, and use that data to iteratively refine their understanding, mirroring in many ways the process that has driven human scientific progress.
Again, a provocative insight. Can we trust ourselves to train models to a groundbreaking capability, if we limit our “fact-checking”, to our own understanding of the world today?
The era of human data offered an appealing solution. Massive corpuses of human data contain examples of natural language for a huge diversity of tasks. Agents trained on this data achieved a wide range of competencies compared to the more narrow successes of the era of simulation. Consequently, the methodology of experiential RL was largely discarded in favour of more general-purpose agents, resulting in a widespread transition to human-centric AI.
However, something was lost in this transition: an agent’s ability to self-discover its own knowledge. For example, AlphaZero discovered fundamentally new strategies for chess and Go, changing the way that humans play these games. The era of experience will reconcile this ability with the level of task-generality achieved in the era of human data. This will become possible, as outlined above, when agents are able to autonomously act and observe in streams of real-world experience, and where the rewards may be flexibly connected to any of an abundance of grounded, real-world signals.
The origins of the recent thinking models are in specialized training. While it was not good enough to create a modern LLM, it did lead to breakthroughs which current models are not able to replicate.
The advent of the era of experience, where AI agents learn from their interactions with the world, promises a future profoundly different from anything we have seen before. This new paradigm, while offering immense potential, also presents important risks and challenges that demand careful consideration, including but not limited to the following points.
On the positive side, experiential learning will unlock unprecedented capabilities. In everyday life, personalized assistants will leverage continuous streams of experience to adapt to individuals’ health, educational, or professional needs towards long-term goals over the course of months or years. Perhaps most transformative will be the acceleration of scientific discovery. AI agents will autonomously design and conduct experiments in fields like materials science, medicine, or hardware design. By continuously learning from the results of their own experiments, these agents could rapidly explore new frontiers of knowledge, leading to the development of novel materials, drugs, and technologies at an unprecedented pace.
However, this new era also presents significant and novel challenges. While the automation of human capabilities promises to boost productivity, these improvements could also lead to job displacement. Agents may even be able to exhibit capabilities previously considered the exclusive realm of humanity, such as long-term problem solving, innovation, and a deep understanding of real world consequences.
Furthermore, whilst general concerns exist around the potential misuse of any AI, heightened risks may arise from agents that can autonomously interact with the world over extended periods of time to achieve long-term goals. By default, this provides fewer opportunities for humans to intervene and mediate the agent’s actions, and therefore requires a high bar of trust and responsibility. Moving away from human data and human modes of thinking may also make future AI systems harder to interpret.
The obvious consequence and risk of going in this direction, is that our ability to supervise and control outcomes will gradually become more limited over time. This would be particularly true if the models do reach a level of intelligence that surpasses us. AI agents will be something completely different than any other creature we've ever known.