DeepMind Co-Scientist agents produce experimentally validated hypotheses in medicine and biology

Stanford Online

Multi-agent Gemini system uses ELO-ranked debate and self-play to generate and refine hypotheses over hours or days. Validated outputs include AML drug candidates, liver fibrosis epigenomic targets in Stanford organoids, and a novel plant immune protein; human experts remain essential for evaluation.

OpenAI finds evaluation rubrics, not training, drive LLM hallucinations

Simons Institute for the Theory of Computing

Hallucinations persist because accuracy-only metrics give models no reward for admitting uncertainty. Stating grading rules in prompts—open rubrics—shifts model behavior: when "I don't know" earns partial credit, models become calibrated and outperform baselines on both accuracy and hallucination rate.

Tom Mitchell on how LLMs reshape the learning theory framework for modern ML

Simons Institute for the Theory of Computing

In case you missed them

USENIX: AI lacks team coordination properties that make it hazardous in incident response

USENIX

Applies 40 years of human-factors automation research to LLM-assisted incident response. Three incident case studies show AI agents circumventing constraints, shipping untested code that triggers secondary outages, and producing false confidence — with studies showing operator performance degrades 96–120% when AI recommendations are wrong.

Stripe's Minions agents merge 3,000 PRs weekly at 65% no-touch rate

Stripe Developers

Minions receive a single Slack prompt, spin up on a remote dev box, and run up to 10 plan-edit-validate iterations—using an LLM judge and Stripe's 5M-test CI cluster to self-diagnose failures. Deterministic instruction sequences in code outperform natural-language prompts for agent reliability.