Callosum beats GPT-4 vision benchmarks by 18-25% with heterogeneous agents at 18x lower cost

AI Engineer

Adrian Bertagnoli demos two systems: heterogeneous recursion maps LLM calls to different models and chips for 7-12x cost reduction on long-context tasks; visual web navigation mixes video-action-language models to outperform GPT-4 by 18% and Gemini 2.5 by 25%, routing simpler subtasks like zooming to smaller models for an 11x speedup.

Google's on-call LLM agents optimize for precision over coverage to earn operator trust

DevOpsDays Zurich

Maria Henrika Peetz details how Google automated repetitive ticket triage by targeting only well-understood ticket types where high precision is achievable—fetching logs, checking monitoring—while ignoring the rest. Dry-run periods showed premature agent actions eroded trust, making precision the primary metric over speed or coverage.

In case you missed them

Independent research finds AI coding tools deliver 4% productivity gain, not 55%

GOTO Conferences

Rasmus Lystrøm contrasts vendor-cited efficiency claims against recent independent studies showing only 4% improvement, 57% of AI-assisted code involving bugs, and reasoning models performing worse on complex tasks. Also covers trust erosion from code quality degradation and GPT-4 training consuming energy equivalent to 6,000 US homes.

Agentic AI velocity gains vanish within 2 months without code health above 9.5

JFokus

Adam Tornhill presents research showing 2-3x task speed gains evaporate in weeks as AI-induced complexity accumulates. Covers three mitigations: MCP server health enforcement, mandatory 100% test coverage, and CLEAR architectural principles — plus evidence that healthy code cuts token consumption 29-50%.