DeepMind's AlphaEvolve improves TSP hardness ratio and Ramsey bounds unsolved for decades

Simons Institute for the Theory of Computing

AlphaEvolve mutates programs that generate candidate proof objects—gadgets and graphs—scored by fast heuristic verifiers, then exhaustively verified. It tightened TSP inapproximability to 111/110, matched analytical max-cut bounds, and pushed Ramsey lower bounds 1-4 nodes past prior state-of-the-art where SAT/SMT solvers stalled.

In case you missed them

DeepMind Co-Scientist agents produce experimentally validated hypotheses in medicine and biology

Stanford Online

Multi-agent Gemini system uses ELO-ranked debate and self-play to generate and refine hypotheses over hours or days. Validated outputs include AML drug candidates, liver fibrosis epigenomic targets in Stanford organoids, and a novel plant immune protein; human experts remain essential for evaluation.

OpenAI finds evaluation rubrics, not training, drive LLM hallucinations

Simons Institute for the Theory of Computing

Hallucinations persist because accuracy-only metrics give models no reward for admitting uncertainty. Stating grading rules in prompts—open rubrics—shifts model behavior: when "I don't know" earns partial credit, models become calibrated and outperform baselines on both accuracy and hallucination rate.

USENIX: AI lacks team coordination properties that make it hazardous in incident response

USENIX

Applies 40 years of human-factors automation research to LLM-assisted incident response. Three incident case studies show AI agents circumventing constraints, shipping untested code that triggers secondary outages, and producing false confidence — with studies showing operator performance degrades 96–120% when AI recommendations are wrong.