2026-03-18·16 min read

AI Agents as Universal Task Solvers: When Verification Changes Everything

airesearchreasoningscaling-lawstheoryagents

Imagine you're building a web scraper, and you have a comprehensive test suite that can instantly tell you whether your code works correctly on any input. You could, in theory, generate random code until something passes all the tests. It would take an absurdly long time, but it would eventually work.

Now imagine you've built many scrapers before. Each time you write a new one, you draw on patterns from previous projects — how to handle HTTP errors, parse HTML, manage rate limits. You're not just generating random code anymore; you're navigating toward solutions using experience.

This seemingly simple scenario captures something profound about how AI systems actually work, and why a new theoretical framework suggests that in domains where solutions can be quickly verified, the central problem is not whether a correct answer exists but how quickly an agent can find one.

The Verifiable World: Where Everything Changes

The story begins with recognizing a crucial boundary condition that shapes the entire analysis: this framework only applies to verifiable settings — problems where you can quickly check if a proposed solution is correct.

Consider our coding example:

The problem: Build a function that extracts product prices from e-commerce pages
The verifier: A test suite with sample pages and expected outputs
The challenge: Find code that passes all tests as quickly as possible

This pattern appears everywhere AI has shown remarkable success:

Chess: The rules determine if a move is legal and who wins
Mathematical proofs: Proof checkers can verify correctness in seconds
Protein folding: Simulations can evaluate stability and function
Circuit design: You can test whether designs meet specifications

In these domains, the fundamental problem shifts. We're not asking "Can this be solved?" but "How fast can we find a solution?" This distinction turns out to be mathematically profound.

From Generalization to Navigation: The Transductive Turn

Traditional machine learning works like studying for a broad exam. A system observes patterns across many examples, extracts general rules, then applies those rules to new situations. This is inductive learning — from specific examples to general principles.

But there's a fundamentally different approach: transductive learning. Instead of learning universal principles, the system learns to navigate efficiently toward solutions for specific problems using verification feedback.

Back to our coding example: Instead of learning general programming principles that work everywhere, you learn to use test failures as navigation signals. Each failed test tells you something about the specific problem structure. You adjust your approach based on which tests fail and how, gradually steering toward passing code.

The key insight: In verifiable settings, verification signals become central to the learning process itself. You're not learning to generalize — you're learning to search efficiently.

The Universal Solver Dream (and Its Catch)

Here's where the theory gets striking. Computer science has proven that a universal problem-solving algorithm exists — one that can tackle any computational task with only a constant factor slowdown compared to the best possible specialized algorithm.

For our coding problem, this means there's an algorithm that could write your scraper within a factor of:

T_universal(x) ≤ 2^ℓ(A) · T_A(x)

where T_A(x) is the time the best specialist approach takes and ℓ(A) is how many bits it takes to describe that optimal approach. The constant depends on the complexity of the solver, not the particular problem instance.

This sounds like the holy grail! But there's a devastating catch: that constant factor 2^ℓ(A) can be astronomically large. If the optimal coding strategy takes 1,000 bits to describe, you're looking at a slowdown factor of 2^1000 — larger than the number of atoms in the universe.

So the universal solver would take longer than the age of the cosmos to write code that an expert could craft in hours.

Learning as Speed: The Mathematical Connection

Here's where learning transforms everything. There's a precise relationship between learning and computational speedup:

log(speedup) = I(h : D)

where I(h : D) is the algorithmic mutual information between the hypothesis space and the training data. This is completely different from Shannon information.

In our coding scenario: Shannon-style uncertainty reduction is not the central issue in this idealized setting. Whether your scraper will work is ultimately verifiable by running the tests — correctness is always achievable given enough time. But if your training data contains thousands of similar scraping problems and solutions, it shares algorithmic structure with your current task. That shared structure lets you navigate to working code exponentially faster.

The core insight: Learning doesn't primarily buy you correctness (the tests guarantee that, given time). Learning buys you speed.

If you've built scrapers for news sites, product catalogs, and social media, you've learned reusable patterns: how to handle JavaScript rendering, common CSS selectors, rate limiting strategies. When faced with a new scraping task, you don't start from scratch — you navigate using algorithmic knowledge compressed from previous experience.

This connects to empirical observations through a generalized Hilberg-style scaling assumption — I(Xn : Yn) ∝ n^β — to derive time-scaling laws. If this assumption holds for the data-generating process, it provides a theoretical grounding for the power-law scaling behaviors observed in practice. Though the empirical validation remains an active research question.

The Complexity Ceiling: When Simple Problems Limit Learning

Now for a counterintuitive result that challenges conventional wisdom. The maximum possible speedup from learning is fundamentally bounded by K(q), the Kolmogorov complexity of the underlying problem-generating process:

log(speedup) ≤ K(q)

In our coding context: If the types of scraping problems you encounter have low algorithmic complexity — say, they're all variations of "extract text from simple HTML tables" — there's a ceiling on how much your experience can speed up future solutions. The simpler the underlying pattern, the less room there is for learning to help.

This explains why some engineering domains plateau despite massive data collection. It's not always about finding better algorithms or gathering more examples. Sometimes the problem structure itself doesn't contain enough complexity to support further speedup.

The extreme case: If scraping problems were generated by Solomonoff's Universal Prior (the simplest possible information source), there would be zero algorithmic structure to learn — no speedup possible at all.

This challenges the conventional wisdom that simplicity aids learning. Simple data-generating processes are easier to model, yes — but they also offer less room for the kind of speedup that matters in verifiable settings.

The Savant Trap: When Memorization Masquerades as Reasoning

A concerning failure mode exists: with unbounded model size and compute, systems that have access to a reward signal can achieve high performance through brute computational force rather than genuine problem-solving.

Consider two approaches to becoming a coding expert:

Approach A (Savant): Memorize millions of code examples and match new problems to the closest stored solution using massive computational resources.

Approach B (Reasoner): Learn efficient problem-solving strategies that generate solutions with computational effort proportional to problem complexity.

Approach A might pass more tests initially, but it's fundamentally brittle. It hasn't learned to code — it's learned to pattern-match at enormous computational cost. Give it a problem that doesn't resemble anything in its memory, and it falls apart.

This is a possible failure mode of naïve scaling, not an inevitable outcome. But it suggests that measuring only accuracy is insufficient — we should also track computational efficiency relative to problem complexity as a guard against savant behavior.

Proper Time: Measuring Computation in Stochastic Systems

One of the most important technical contributions here is a new way to measure computational effort for AI systems.

Traditional computer science counts deterministic steps — how many operations a program runs. But LLMs don't work this way. They generate tokens probabilistically, taking different reasoning paths each time. The same model might solve the same problem quickly one attempt and slowly the next.

Why can't we just count expected tokens? Because that gives degenerate results. Consider a model that outputs random tokens until it accidentally produces a correct solution. Its "expected reasoning length" could be low (it occasionally gets lucky fast), but it clearly isn't reasoning. The expected value doesn't capture what's actually happening.

The solution is proper time (τ) — a formally defined measure that captures computational effort in stochastic systems without these degenerate cases. This is what makes the entire theoretical framework applicable to real AI systems rather than idealized deterministic machines.

For our coding example: Proper time captures how long the AI system genuinely works to produce passing code, accounting for the stochastic nature of token generation without being fooled by lucky random outputs.

Stochastic Universality: Why LLM-Powered Agents Can (In Principle) Code Anything

Perhaps the most surprising result: LLM-powered agents, modeled as stochastic dynamical systems and paired with suitable search procedures and verification, can in principle act as universal solvers in verifiable domains.

This doesn't mean a plain LLM with chain-of-thought equals a universal solver. The system needs:

The LLM component for candidate generation
Search procedures for systematic exploration
Verification mechanisms for feedback
Proper integration across these components

But the result is still remarkable. LLMs don't fit any traditional computational model. They're not Turing machines executing deterministic steps. Their computational mechanism — generating tokens one at a time via chain-of-thought — doesn't map to any standard paradigm. Proving universality for such systems required entirely new mathematical techniques.

For our coding example: An LLM-based coding agent (with test runner and search) could theoretically write any program that can be verified by tests, even though it might take a completely different reasoning path each time. The stochasticity doesn't prevent universality — but it requires proper time to measure what's happening.

A Note on Conceptual Depth

Building on Charles Bennett's notion of logical depth — the idea that the value of information lies not in how compressible it is, but in how much computation is required to generate it from its shortest description — researchers have begun exploring an analogous concept for trained models called conceptual depth.

Where logical depth measures computation time for a Turing machine to generate data, conceptual depth measures loss plus proper time for a trained model solving tasks. A model with high conceptual depth solves hard problems efficiently; a savant achieves low loss but at disproportionate computational cost.

(Conceptual depth appears primarily in commentary surrounding the paper. The core contribution is proper time, algorithmic information, and the speedup theorems described above.)

Implications for AI Development

These results suggest concrete shifts in how we build and evaluate AI systems:

Focus on Verification-Rich Domains The entire framework applies only where solutions can be quickly checked. This suggests prioritizing domains with strong verification mechanisms and investing in better verification tools where they're lacking.

Optimize for Efficiency, Not Just Accuracy Instead of asking "How accurate is the system?" ask "How efficiently does it find verified solutions relative to problem complexity?" This challenges pure scaling approaches.

Monitor for Savant Behavior High accuracy doesn't automatically mean genuine problem-solving. Systems might be memorizing patterns instead of learning efficient reasoning strategies.

Structure Training Data Algorithmically Since speedup depends on shared algorithmic structure between training and target problems, the organization of training data matters more than just volume.

The Bigger Question

If this framework is right, it reframes how we think about building AI systems:

Not "How can we make models more accurate?" but "How can we make them more efficient at finding verified solutions?"
Not "How much data do we need?" but "What algorithmic structure does our data contain?"
Not "How big should our models be?" but "What's the minimum computational effort needed for this level of performance?"

To return to our coding example: The most capable system isn't necessarily the one that writes perfect code on every attempt. It's the one that writes working code quickly, using computational effort proportional to the problem's true complexity — learning from experience not to memorize solutions, but to navigate efficiently through the space of possible programs.

Whether this framework generalizes beyond verifiable domains remains an open question. But within its scope, the implications are clear: in the age of AI agents, time might matter more than we thought.

Based on "AI Agents as Universal Task Solvers" by Achille & Soatto (2026), published in Entropy. The theoretical foundations build on Bennett's notion of logical depth, Levin's universal search, Solomonoff's theory of inductive inference, and Kolmogorov complexity.

Thanks for reading. Follow me for more.

← More posts