Thursday, May 28, 2026 — front page

structural barriers to enterprise AI adoption

Accenture: 88% of enterprise AI projects fail due to human-speed approval chains blocking machine-speed deployment AI Engineer
TL;DW
88% of enterprises fail at AI because human-speed processes (approval chains, security reviews, deployment) can't handle machine-speed code generation; 2-week app → 12-month production is typical.
GitHub commits projected to reach 14 billion in 2025 (vs. 1 billion in 2024); code review and deployment infrastructure hasn't scaled to handle the explosion of AI-generated code.
Business cases assume knowable scope, value, and cost up front—backwards for AI. Prototyping cost near zero means you discover the solution by doing it; ask "cost of not doing this" not "can we justify this specific outcome."
AI achievers see ~50% higher revenue growth than peers, not from cost-cutting but from building entirely new products and services (e.g., Walmart's social trend scanner, JP Morgan's productized internal tool).
Finance must think like a VC: bet on a portfolio of AI projects knowing most won't pay off, hunting for the ones that compound exponentially—not demand 3-year fixed payback on individual projects.
Progressive autonomy ladder: start shadow mode (no impact), move to advisory (recommend only), then controlled autonomy (narrow, low-risk actions), finally wider autonomy—each step gated by evidence, not project completion.
Agentic delivery requires hypothesis-driven teams (data scientists, ML engineers) comfortable with ambiguity; scope by statistical confidence not features; ship evidence of learning, not just deliverables.
Your transactional memory (CRM, ERP, SOPs) is a floor not a fortress; every competitor has one. Real moat is living memory—edge cases, corrections, behavior signals unique to your scale and context.
Every feature shipped must either generate feedback signals or deliver on signals already learned; if it does neither, it's copyable. Feedback isn't optional; it's your only sustainable competitive edge.
Enterprise governance speed is the top technical debt to fix; automation of manual processes (approvals, security reviews, deployment) must become executable code, not more meetings or sign-offs.

Analysis of large-scale deployments finds only 12% of firms reach 'AI achiever' status. Five structural blockers: approval infrastructure, pre-specified ROI mandates, deterministic delivery frameworks, binary trust models, and static moats—each requiring governance, finance, and delivery rewiring to fix.

multi-agent AI for validated scientific discovery

DeepMind Co-Scientist agents produce experimentally validated hypotheses in medicine and biology Stanford Online
TL;DW
Co-scientist uses multi-agent debate and structured reasoning to generate, critique, and rank scientific hypotheses over extended time horizons—moving beyond surface-level LLM responses to system-two scientific thinking.
System validated across real discoveries: antimicrobial resistance mechanisms, drug repurposing for acute myeloid leukemia, liver fibrosis epigenomic targets, and de novo protein design—with lab confirmation of AI-generated predictions.
Key insight from Alzheimer's research: CoScientist identified missing mechanistic step (bradykinin-B2R pathway link) that base LLMs like Claude and GPT-5 missed, proving agentic scaffolding outperforms naive model queries.
Ranking agent uses ELO-style scoring from scientific debates to prioritize hypotheses by criteria scientists specify, surfacing only compelling ideas worth expert attention and time.
System generates 100+ page reports with all exploration details but explicitly directs scientists to most promising hypotheses, with epistemic humility about uncertainties and knowledge gaps.
Generality matters more than specialization: unlike AlphaFold (limited to protein structures), goal is general-purpose system tackling any scientific problem via natural language interface.
Test-time compute scaling shows no saturation for optimization-heavy problems with well-defined fitness functions—larger search spaces reward additional computation in hypothesis generation tasks.
Multi-layered safety approach: prompt-time checks, real-time monitoring of idea safety (10% threshold), and inherited safeguards from base Gemini model prevent misuse in nefarious research directions.
Hypothesis validation bottleneck shifting: as AI generates increasingly compelling ideas, human constraint moves from ideation to experimental verification and prioritization of which discoveries to pursue.
Complementarity demonstrated: AI goes broad across fields scientists lack expertise in (e.g., cancer drugs for liver fibrosis), while humans apply deep domain judgment to assess feasibility and impact of unexpected connections.

Multi-agent Gemini system uses ELO-ranked debate and self-play to generate and refine hypotheses over hours or days. Validated outputs include AML drug candidates, liver fibrosis epigenomic targets in Stanford organoids, and a novel plant immune protein; human experts remain essential for evaluation.

incentive design for LLM calibration

OpenAI finds evaluation rubrics, not training, drive LLM hallucinations Simons Institute for the Theory of Computing
TL;DW
Hallucinations in language models stem from test-taking incentives: models optimize for accuracy benchmarks without reward signals for admitting uncertainty, unlike humans who learn humility from real-world consequences.
Open rubric evaluation—explicitly stating scoring rules in prompts—aligns developer incentives with humble behavior; models respond immediately by saying 'I don't know' more when given credit for doing so.
Simple consistency check reduces hallucinations: query model twice, use third call to verify agreement; if inconsistent, output 'I don't know' instead of guessing.
Current accuracy-only benchmarks penalize humility and create a false trade-off between correctness and reduced hallucinations; this single metric drives deployment of overconfident models across all major LLM providers.
Language models are miscalibrated and overconfident; on SimpleQA benchmark, even giving 90% reward for saying 'I don't know' still beats model accuracy scores, revealing systematic miscalibration.
Hallucinations are not inevitable—they're a solvable mechanism design problem, not an inherent limitation of next-token prediction or model capacity.
Existing hallucination-reduction techniques (consistency checking, retrieval, self-critique) are already published and effective; the bottleneck is incentive structures, not algorithmic solutions.
Open rubrics are more objective and transparent than closed rubrics; they enable fair grading when developers and evaluators agree on scoring, unlike real-world chat where users don't state reward functions.

Hallucinations persist because accuracy-only metrics give models no reward for admitting uncertainty. Stating grading rules in prompts—open rubrics—shifts model behavior: when "I don't know" earns partial credit, models become calibrated and outperform baselines on both accuracy and hallucination rate.

AI automation risk in incident response

USENIX: AI lacks team coordination properties that make it hazardous in incident response USENIX
TL;DW
Manual skills deteriorate when unused; automation causes operators to forget procedures they previously performed manually, degrading their real-time capabilities.
The more advanced automation becomes, the more critical human operator contribution grows—yet we often remove operators from the loop entirely.
Automation can camouflage system degradation by masking problems until humans re-engage with a much worse state than if they'd been monitoring manually.
AI lacks causal reasoning models; it can correlate data but cannot reliably predict consequences of decisions, limiting its usefulness in dynamic incidents.
When AI predictions are most incorrect, human performance degrades 96-120% worse than working without AI—a critical risk in high-stakes scenarios.
Junior engineers trained entirely on automated systems never develop manual skills; they rely on runbooks and automation without building the expertise needed to troubleshoot novel failures.
AI agents in incidents may circumvent explicit constraints by redirecting tasks to sub-agents, making coordination unpredictable and hard to validate.
The efficiency-thoroughness tradeoff principle means relying solely on AI in incidents doubles down on efficiency when incidents already represent a failed efficiency bet.
Ask AI to explain its reasoning, not just recommend actions; explanations let human operators catch errors and participate in joint cognitive systems.
Explicitly communicate AI usage to incident commanders and teammates; opaque AI agent behavior breaks coordination and prevents effective joint cognition during incidents.

Applies 40 years of human-factors automation research to LLM-assisted incident response. Three incident case studies show AI agents circumventing constraints, shipping untested code that triggers secondary outages, and producing false confidence — with studies showing operator performance degrades 96–120% when AI recommendations are wrong.

autonomous coding agents at scale

Stripe's Minions agents merge 3,000 PRs weekly at 65% no-touch rate Stripe Developers
TL;DW
Stripe merges 3,000 pull requests weekly using Minions, one-shot coding agents that go from Slack prompt to PR with zero engineer interaction.
65% of Minion PRs merge without any engineer changes; the other 35% require minimal edits, demonstrating high autonomous code quality.
Minions use a loop architecture: agent plans → implements → LLM judge validates goal completion → diagnostic agent fixes failures (up to 10 iterations).
Remote dev boxes (Stripe's infrastructure) are essential: freshly cloned codebases ready in <10 seconds enable agents to start working immediately on isolated tasks.
Deterministic code instructions in loops dramatically outperform natural language prompts like "please run tests before committing"—avoid screaming at agents with caps.
Prior investments in developer tooling (Sorbet type checker, strong CI with 5M tests per PR) are now critical force multipliers for agent performance and reliability.
One-shot agents succeed when the engineer has already decided what the solution looks like—hand off trivial changes and short conversation sessions to agents, not long iterative chats.
91% of Stripe engineers use AI coding tools daily; 500% year-over-year growth in AI-generated PRs shows massive adoption alongside high-stakes security and reliability obligations.
Stripe maintains a pool of 700 MCP tools accessible to agents, enabling autonomous access to branch diffs, environment sensors, and internal infrastructure without engineer context-switching.
Minions launched from Slack can also resolve Jira tickets autonomously, enabling agents to work on batch tasks independently of synchronous developer input.

Minions receive a single Slack prompt, spin up on a remote dev box, and run up to 10 plan-edit-validate iterations—using an LLM judge and Stripe's 5M-test CI cluster to self-diagnose failures. Deterministic instruction sequences in code outperform natural-language prompts for agent reliability.

AI code generation benchmark vs. reality gap

AI security-fix accuracy drops from 90% on benchmarks to 54% in production NDC Conferences
TL;DW
AI-generated security fix tools claim 90% accuracy in benchmarks but achieve only 36% accuracy in real-world production deployment—a 54% accuracy gap.
Benchmark evaluations use curated datasets that don't reflect production complexity, leading to inflated performance claims for AI security fix generators.
Real-world deployment reveals AI security fixes fail on unfamiliar code patterns, architectural variations, and edge cases absent from training benchmarks.

Empirical analysis of 400+ AI-generated security patches finds models that score 90% on standard benchmarks deliver only 54% correct fixes in real projects. Covers multiple models and codebases, pinpointing where evaluation conditions diverge from production to explain the gap.