Stanford Online
Multi-agent Gemini system uses ELO-ranked debate and self-play to generate and refine hypotheses over hours or days. Validated outputs include AML drug candidates, liver fibrosis epigenomic targets in Stanford organoids, and a novel plant immune protein; human experts remain essential for evaluation.
Simons Institute for the Theory of Computing
Hallucinations persist because accuracy-only metrics give models no reward for admitting uncertainty. Stating grading rules in prompts—open rubrics—shifts model behavior: when "I don't know" earns partial credit, models become calibrated and outperform baselines on both accuracy and hallucination rate.
Simons Institute for the Theory of Computing
In case you missed them
USENIX
Applies 40 years of human-factors automation research to LLM-assisted incident response. Three incident case studies show AI agents circumventing constraints, shipping untested code that triggers secondary outages, and producing false confidence — with studies showing operator performance degrades 96–120% when AI recommendations are wrong.
Stripe Developers
Minions receive a single Slack prompt, spin up on a remote dev box, and run up to 10 plan-edit-validate iterations—using an LLM judge and Stripe's 5M-test CI cluster to self-diagnose failures. Deterministic instruction sequences in code outperform natural-language prompts for agent reliability.