Sunday, May 24, 2026 — front page

heterogeneous agent economics

Callosum beats GPT-4 vision benchmarks by 18-25% with heterogeneous agents at 18x lower cost AI Engineer
TL;DW
Heterogeneous agent orchestration—mixing different model sizes and architectures—outperforms GPT-4.2 and Gemini 2.5 on visual web navigation by 18–25% while being 18× cheaper.
Heterogeneous recursion maps sub-context to different models/chips instead of recursive calls on identical hardware, achieving 7–12× cost reduction and 3–5× speedup vs. frontier models on long-context tasks.
Real-world problems decompose into sub-problems requiring vastly different intelligences; singular model scaling is inefficient—heterogeneous multi-agent systems are mathematically proven superior across neuroscience, economics, and ecology.
Task decomposition enables massive efficiency gains: zooming subtasks run 11× faster and 43× cheaper on lightweight models than ChatGPT, accumulating to 3.7× overall cost savings.
Automation layer now detects task complexity and predicts optimal model/hardware pairing rather than hardcoded mappings, enabling dynamic heterogeneous routing.
New silicon (Cerebras, SambaNova) lacks unification interface to current compute stacks; heterogeneous orchestration solves this by mapping workloads to optimal available hardware.
Third era of compute scales heterogeneously across models, workflows, and silicon co-evolution—replacing CPU acceleration and GPU parallelization paradigms entirely.
Mixture of experts on architecture, multi-agent workflows, and pre-fill/decode disaggregation represent mild heterogeneity; full heterogeneity requires vertical integration of intelligence and hardware.

Adrian Bertagnoli demos two systems: heterogeneous recursion maps LLM calls to different models and chips for 7-12x cost reduction on long-context tasks; visual web navigation mixes video-action-language models to outperform GPT-4 by 18% and Gemini 2.5 by 25%, routing simpler subtasks like zooming to smaller models for an 11x speedup.

production agent trust and precision

Google's on-call LLM agents optimize for precision over coverage to earn operator trust DevOpsDays Zurich
TL;DW
Prioritize precision over coverage in agent-based ticket triage—teams request more coverage only after trusting high-precision automation, not before.
Run agent actions in dry-run mode for extended periods before production deployment to build operator confidence and avoid rogue comments that worsen workload.
Overfitting is a major risk when autogenerating skills for few tickets; keep humans in the loop during skill creation and maintain continuous feedback loops from live handling.
Use cron jobs aligned with on-call shift starts so engineers see pre-filtered, relevant ticket queues rather than accumulated noise from the previous shift.
Agents should only read production data (logs, monitoring) and create change lists—never mutate infrastructure without human oversight and monitoring.
Cultural shift needed: on-callers must validate agent responses are adequate, not just fix alerts; responsibility expands from alert management to response quality.
Frame ticket automation as temporary band-aid that frees engineering capacity to fix root causes, not as permanent noise-handling solution.
Start slow, iterate quickly on team feedback, and continuously deliver small wins to maintain adoption momentum without losing trust through speed or slowness.
Prepare for adoption success early with self-service approach and canned response templates—four engineers spent mornings answering adoption requests unprepared.
Collaboration across teams and alignment on shared values (eliminating soul-crushing work) drives better results than individual "winning" in AI implementation space.

Maria Henrika Peetz details how Google automated repetitive ticket triage by targeting only well-understood ticket types where high precision is achievable—fetching logs, checking monitoring—while ignoring the rest. Dry-run periods showed premature agent actions eroded trust, making precision the primary metric over speed or coverage.

AI productivity claims vs evidence

Independent research finds AI coding tools deliver 4% productivity gain, not 55% GOTO Conferences
TL;DW
Study claiming 55.8% AI productivity gains lacks credibility; follow-up research found only 4% boost and zero significant labor market impact on earnings or hours.
Reasoning models perform worse on high-complexity tasks, taking orders of magnitude longer; agents excel only in low-medium complexity tasks in well-tested, debt-free codebases.
57% of code written with AI copilot tools is involved in bugs; code churn, duplication, and refactoring activity all increased significantly since AI adoption.
AI-generated work slop masquerading as quality reduces trust: 53% report annoyance receiving it, and 50% view colleagues who send it as less creative, capable, and trustworthy.
Writing a 100-word email with AI consumes 140 watt-hours of energy (seven phone charges); training GPT-4 used 50 gigawatt-hours—equivalent to 6,000 US homes' annual consumption.
Stop automating broken processes with AI; eliminate them instead. Adding AI to dysfunctional workflows creates insatiable demand for more reports, not solutions.
No customer is asking for AI chatbots, AI emails, or AI interaction; talk directly to users about what's actually hard, slow, and painful before building anything.
Organizations need pioneers (ideators), settlers (productizers), and town planners (commoditizers), but asking one person to fill all three roles guarantees failure.
Context-switching across multiple projects kills shipping; the best way to fail at inventing something is making it a part-time job alongside existing responsibilities.
Build small, ship fast to production, measure actual user behavior, and roll back quickly—then market working solutions as AI-powered to capitalize on hype without chasing false productivity claims.

Rasmus Lystrøm contrasts vendor-cited efficiency claims against recent independent studies showing only 4% improvement, 57% of AI-assisted code involving bugs, and reasoning models performing worse on complex tasks. Also covers trust erosion from code quality degradation and GPT-4 training consuming energy equivalent to 6,000 US homes.

agentic coding quality debt

Agentic AI velocity gains vanish within 2 months without code health above 9.5 JFokus
TL;DW
AI coding delivers 2-3x task speedup, but initial velocity gains disappear after 2 months due to AI-induced code complexity if code health isn't maintained.
Healthy code (code health score 10) reduces AI defect rates dramatically; unhealthy code (below 9) causes AI break rates to escalate beyond acceptable levels and increase defects by 60%.
Average enterprise codebase has code health of 5.15—far below the 9.5 minimum needed for AI safety; legacy code will bottleneck agentic adoption without uplift.
AI frequently generates code with low modularity, deep nesting, missing error handling, and poor structure—unhealthy code it cannot reliably maintain or extend itself.
Use MCP servers integrated with AI assistants to enforce code health checks automatically; with feedback loops, AI fixed 90-100% of code health issues versus only 50-55% without guidance.
Require 100% code coverage on new/modified code and existing codebase to prevent AI from deleting failing tests and ensure verification; coverage became one of speaker's most important KPIs.
Focus manual code review on tests, not implementation; define specifications as executable test code first, then trust automated safeguards (MCP, linting) for implementation verification.
Healthy code reduces token consumption by 29-50% compared to unhealthy code for identical tasks; as token pricing increases, code health becomes a financial imperative.
Architectural design principles (CLEAR framework) must complement code health to limit blast radius during evolution and enable safe agentic architecture at scale—still largely unsolved.
The majority of software costs (up to 95%) occur after first release during evolution and maintenance, where code quality and architecture determine success with agentic tools.

Adam Tornhill presents research showing 2-3x task speed gains evaporate in weeks as AI-induced complexity accumulates. Covers three mitigations: MCP server health enforcement, mandatory 100% test coverage, and CLEAR architectural principles — plus evidence that healthy code cuts token consumption 29-50%.