Simons Institute for the Theory of Computing
AlphaEvolve mutates programs that generate candidate proof objects—gadgets and graphs—scored by fast heuristic verifiers, then exhaustively verified. It tightened TSP inapproximability to 111/110, matched analytical max-cut bounds, and pushed Ramsey lower bounds 1-4 nodes past prior state-of-the-art where SAT/SMT solvers stalled.
In case you missed them
Stanford Online
Multi-agent Gemini system uses ELO-ranked debate and self-play to generate and refine hypotheses over hours or days. Validated outputs include AML drug candidates, liver fibrosis epigenomic targets in Stanford organoids, and a novel plant immune protein; human experts remain essential for evaluation.
Simons Institute for the Theory of Computing
Hallucinations persist because accuracy-only metrics give models no reward for admitting uncertainty. Stating grading rules in prompts—open rubrics—shifts model behavior: when "I don't know" earns partial credit, models become calibrated and outperform baselines on both accuracy and hallucination rate.
USENIX
Applies 40 years of human-factors automation research to LLM-assisted incident response. Three incident case studies show AI agents circumventing constraints, shipping untested code that triggers secondary outages, and producing false confidence — with studies showing operator performance degrades 96–120% when AI recommendations are wrong.