Thursday, May 21, 2026 — front page

scaling limits, adaptation frontier

Sara Hooker: scaling is hitting limits, adaptation and post-training are the next frontier Hugging Face
TL;DW
Scaling model size shows decreasing returns; GPT-4.5, Llama 4, and Mixtral releases failed to justify their computational costs despite larger sizes.
Small models now frequently outperform large ones on benchmarks; most neural network weights are redundant and can be removed after training with minimal performance loss.
Post-training, test-time scaling, and adaptive compute now offer better returns than pre-training compute; frontier labs unlikely to 4x model size again this year.
Adaptation and continuous learning emerge as the frontier; efficiency matters most because speed of learning from new information determines competitive advantage.
Optimization in data space is now cheaper than ever; targeted data curation and generation can steer model behavior toward rare parts of distributions without massive pre-training costs.
Auto Scientist automates end-to-end fine-tuning and outperforms human researchers at hyperparameter configuration by searching wider model families than domain experts typically optimize.
Small labs with strong data and training strategies can now compete; test-time compute doesn't require collocated infrastructure like pre-training, enabling distributed innovation.
Transformers are saturated architectures; hardware is overfit to matrix multiplication, making alternative architectures (capsule networks, sparse models) empirically difficult to succeed despite theoretical merit.
Pre-training, post-training, and test-time scaling serve different functions; keep data fresh across stages by injecting new information rather than repeating, and reserve parametric capacity for skills while using retrieval for facts.
Adaptive interfaces matter as much as models; code and design enable rich feedback loops absent in chat-interface thumbs-up systems—future interfaces should enable human-AI collaboration, not just mimic human behavior.

Hooker presents evidence that smaller models now outperform larger ones, model weights carry severe redundancy, and recent releases like GPT-4.5 and Llama 4 showed returns too poor to justify serving costs. The talk covers three vectors: post-training optimization, test-time compute on high-uncertainty examples, and continuous learning — illustrated by Auto Scientist, which outperformed human researchers on fine-tuning configuration search.

context rot vs agentic retrieval

Chroma's Context One hits SOTA retrieval F1 at 75x the speed of Claude Opus DeepLearning.AI
TL;DW
Agentic search—iterative loops where models use retrieval tools and decide when to stop—outperforms long-context models because language models degrade in accuracy beyond ~40k-100k tokens despite marketed million-token windows (context rot).
Most AI failures today are context failures, not reasoning failures; context engineering (curating information to fit token budgets) is now more critical than improving model reasoning.
Chroma's Context One, a 20B-parameter model, achieves state-of-the-art agentic search performance at 3,000 tokens/second on Cerebras—50× smaller and 25× cheaper than Opus while matching or exceeding accuracy.
Agents need agentic search on both read (what information to retrieve) and write (where to store learned information) paths to maintain consistent knowledge across operations.
System must handle all quadrants: simple queries on small datasets, simple queries on large datasets, complex queries on small datasets, and complex queries on large datasets.
Speed is an underrated secular trend: faster inference (15k-20k tokens/second coming soon) enables pushing search compute to the data layer, reducing network costs and rethinking system architecture.
Small language models trained for agentic search, not frontier models, will become the dominant tool for context retrieval and writing tasks in production systems.
Continual learning for agents over the next 1-3 years will occur at the context layer—adding knowledge to retrieval systems and fine-tuning cheap small models—not by updating reasoning model weights.
Context engineering is analogous to System 2 thinking in the brain: narrow, expensive reasoning requires a fast, cheap context layer to surface relevant information for focus.

Jeff Huber argues context windows suffer 'context rot' beyond ~40K tokens, making naive stuffing ineffective. Chroma's 20B-parameter Context One model uses agentic search loops—hybrid search, regex, document fetching—to hit state-of-the-art retrieval F1 at 3,000 tokens/sec versus Opus's 40, at 1/25th the cost.

AI value chain economics

Stanford: 75% of AI revenue flows to chips, leaving applications structurally unprofitable Stanford Online
TL;DW
Economic value in AI is concentrating in chips ($300B+ revenue), not applications—Nvidia's data center business earns ~75% gross margins while app-layer profitability ranges 0-30%.
AI triangle resembles 2004 cloud economics; it took AWS eight years (2004-2012) to flip from infrastructure to applications dominance—this AI inversion may take longer due to substrate complexity.
Chip inference workloads represent ~40% of Nvidia GPU fleet utilization, training ~60%; inference share expected to grow as applications scale, unlocking profitability.
ChatGPT reached ~1 billion users monetized at $10/user/year; comparison: Alphabet monetizes 4 billion users at $100/user/year—growth requires expanding beyond knowledge work into mandatory daily utility status.
Hyperscaler ASICs (Google TPU, Meta MTIA, Amazon, OpenAI efforts) represent the likeliest repricing catalyst for the semis layer; breakthrough success would reshape dominance.
Vertical integration winners: Google (internet), Apple (mobile), Meta (social), but cloud remains heterogeneous—full vertical integration may be necessary for AI super-cycle dominance.
Revenue concentration risk: last two years added $350B to AI ecosystem; 75% accrued to semis, 90% of applications revenue split between two companies (OpenAI, Anthropic/Google).
AI application monetization will likely shift from subscriptions ($10/user) to intent-based advertising with better attribution and pricing than mobile ads—no model yet proven on phone-scale consumer AI.
Apps layer infrastructure quality poses bottleneck: hardest part of AI ecosystem is getting the substrate right—until solved, profitability remains trapped in chips.
Feature versus platform distinction critical for infrastructure startups: companies solving narrow problems risk becoming AWS features rather than standalone businesses.

Maps the generative AI value chain across semiconductors, infrastructure, and applications, showing $350B in new revenue concentrated at Nvidia despite 10x application growth over two years. Covers why near-zero marginal cost breaks down when serving users burns GPU compute, and what conditions—custom ASICs, inference dominance, hyperscaler integration—could reprice the stack.

parametric vs in-context generalization

Stanford study finds LLMs reverse and compose facts in-context but fail when fine-tuned Stanford Online
TL;DW
Language models generalize better from in-context information than from fine-tuned parameters—reversals and syllogisms hit 99% accuracy in context but near chance after fine-tuning.
The reversal curse persists even when training models from scratch on synthetic data, showing it's a fundamental limitation of parametric learning, not just fine-tuning.
In-context learning succeeds because structural patterns (reversals, logical implications) appear frequently in natural training data, making them learnable as flexible procedures.
Parametric learning ties knowledge to explicit surface forms in training data, while in-context learning preserves richer detail that enables flexible reuse.
Offline data augmentation—using the model's in-context reasoning to generate latent inferences and adding them to training data—matches or exceeds pure in-context performance.
Test-time episodic retrieval (bringing learned documents back into context via oracle memory) restores generalization on reversal and other latent-structure tasks.
Reinforcement learning can train models to regenerate needed information via chain-of-thought, generalizing reasoning patterns to new domains, though it struggles with symmetry-breaking tasks like reversals.
These three methods trade off compute cost: offline augmentation is expensive to train but cheap at test time; test-time retrieval is cheap to train but expensive at inference.
Hippocampus-like episodic memory and cortex-like parametric learning are complementary in humans—fast episodic storage plus slow parametric integration prevents interference while enabling rapid single-trial learning.
Parametric systems need statistical structure to efficiently constrain inference; pure symbolic reasoning lacks feasibility for real-world generalization, requiring the models learn content-sensitive reasoning patterns.

Controlled experiments on facts, syllogisms, and encodings show fine-tuned models fail to reverse relations or compose logical chains, while the same models nearly ace both tasks given the data in context. Three mitigations tested: offline data augmentation, episodic retrieval at inference time, and RL-driven regeneration, each trading training cost for inference cost.

agent reliability vs capability

AWS bets enterprise agentic AI is gated on defect rates, not model capability DeepLearning.AI
TL;DW
AWS believes defect rate improvements, not frontier model advances, will unlock enterprise agentic AI adoption and value creation.
Low-frequency, low-consequence defects represent the key opportunity; high-consequence defects require expert-only fixes and severely limit scale.
Hydro (Rust framework) enables agents to build correct-by-construction distributed systems, addressing models' weakness in concurrency and failure reasoning.
Cedar policy language uses formal reasoning to make authorization correct-by-construction, reducing defects in critical control systems.
Auto-formalization converts natural language specifications into mathematically precise Cedar or Lean code through interactive conversation with customers.
Deterministic agent steering via pre- and post-conditions on tool calls balances model creativity with mathematically precise behavioral constraints.
Benchmarks should measure failure severity and customer impact, not just failure density; replace metrics like pass@10.
Industry needs end-to-end evaluation including operational properties—performance, cost, durability, availability—not just raw model capability.
AWS open-sourced Trusted Remote Execution, constraining agent-built cloud operation scripts with formal Cedar policies in production.
Investing in failure understanding and mitigation is as important as improving best-case performance; culture must treat worst days seriously.

Marc Brooker maps agent failures into four quadrants by frequency and severity, argues only low-frequency, low-consequence errors have real enterprise TAM, and outlines AWS investments in correct-by-construction frameworks (Hydro, Cedar), automated reasoning, and deterministic agent steering to get there.

agentic AI compressing development cycles

AMD ships GPU instruction translator in 48 hours using AI agents instead of years DeepLearning.AI
TL;DW
K-shaped future of software: top arm (systems thinking, problem framing) accelerates 100x; bottom arm (language syntax, specific coding knowledge) commoditizes via AI agents.
Intent velocity—speed from idea to production—is the core metric, not lines of code. Measure business outcomes, not throughput.
Winners operate agents in parallel autonomously (nights, during meetings); set intent once, unlock multiple agents to run simultaneously.
GEEK agent autonomously optimizes customer software performance non-stop; customers serve tokens faster without manual intervention.
GPU-to-GPU instruction translation (Rosetta) took 48 hours with AI vs. 4-5 years and 200-300 engineers traditionally; now shipping in production.
Llama CPP runtime now moves tensors between CPU/GPU/NPU with zero-cost overhead; enables full silicon utilization on laptops.
AMD built world's fastest open-source tokenizer (200K lines, one engineer); becomes pre-training data for future models—compounding flywheel.
Agents now handle continuous monitoring: auto-recreate bugs, file PRs, fix code, validate tests, commit if CI passes—no human involvement needed.
Frame problems clearly and guide AI agents; no longer limited by coding capacity but by ability to think forward at first principles.
AI transitions happen in weeks/months, not years—speed and adaptability now essential; leaders must upskill on agentic AI while helping teams transition.

Anush Elangovan details four AMD projects where agentic AI collapsed multi-year cycles: a Rosetta-like GPU instruction translator built in 48 hours, an autonomous performance optimizer, seamless CPU-GPU-NPU tensor movement, and a high-speed tokenizer. The competitive shift moves from syntax knowledge to systems thinking and intent velocity.