Friday, May 29, 2026 — front page

PyPI phishing via typo-squatted MitM proxy

Typo-squatted domain + MitM proxy nets four PyPI accounts, injects malware into 30M-download package OpenSSF
TL;DW
Attackers registered a one-letter domain typo (e.g., pypi.org vs a similar domain) and built a man-in-the-middle proxy to phish PyPI maintainers; only four clicked through but had credentials compromised.
Phishing-resistant WebAuthn (passkeys/hardware keys) cannot be exploited by proxy attacks because the browser cryptographically validates the correct domain before prompting—TOTP codes can be captured and reused.
Attackers targeted num-to-words, a transitive dependency of Hugging Face Transformers (30M daily downloads), banking on unpinned dependencies to distribute Scavenger Loader malware at scale.
PyPI processes 13 billion requests per day with 900+ new packages daily; only one full-time security engineer handles incident response because volunteer staff cannot provide 24/7 coverage.
Trusted publishing eliminates long-lived API tokens by cryptographically linking package uploads to CI/CD platforms (GitHub, GitLab, Google Cloud Build, CircleCI); removes the primary attack vector.
Domain registrars and abuse services notify attackers when reports are filed, defeating rapid response; legal cease-and-desist letters are now required to effectively block malicious domains.
PyPI added mandatory email reconfirmation for TOTP logins from new devices/IPs (November 2025) as friction to slow phishing success while promoting WebAuthn as the frictionless alternative.
Attackers returned in September 2025 with the same attack pattern, proving persistence; domain registration remains cheap and threat actors are learning PyPI patterns and targeting popular transitive dependencies.
Dependency cool-down periods (3–7 days in pip, uv, Dependabot) let security researchers catch malicious packages before widespread installation; median detection time is ~5 hours during working hours.
WebAuthn adoption requires significant UX/cultural change and education; PyPI cannot mandate it without breaking existing workflows, but must nudge users toward phishing-resistant authentication over time.

Attackers registered a one-character lookalike of pypi.org, ran a MitM proxy to capture TOTP sessions and mint API tokens, then published a Scavenger Loader variant via num-to-words—a transitive dependency of Hugging Face Transformers. WebAuthn would have blocked the attack; response took 40 volunteer hours across registrars and maintainers.

Signals standardization reshapes JS reactivity

TC39 moves to standardize signals as JS frameworks converge on fine-grained reactivity NDC Conferences
TL;DW
Signals are reactive variables that automatically recalculate dependent values when their dependencies change, eliminating manual recalculation work in application state management.
JavaScript frameworks shifted from pull-based rendering (server-side templates) to push-based DOM updates (jQuery era) to gain performance, losing predictability in the process.
Knockout introduced observables and data binding to regain the predictability of pull-based approaches while maintaining push performance—a pattern Vue, Svelte, Solid, and Angular have adopted.
React deliberately uses the pull approach (whole-app re-renders on state change) with memoization and virtual DOM instead of signals, prioritizing consistency and predictability over fine-grained reactivity.
Signal implementations must handle order-of-recalculation (topological sorting), batching updates to prevent UI glitches, and dirty-state tracking to skip unnecessary recalculations.
TC39 proposal aims to standardize signals in JavaScript core language rather than having each framework reinvent the wheel with custom implementations.
Solid pioneered the 'push and pull' hybrid approach for signals: pushing state changes but only recalculating derived values when actually read from the UI.
React's new compiler automatically memoizes functions and state, achieving similar efficiency gains to signal-based frameworks without changing React's fundamental pull-based architecture.

Traces the shift from React's pull model (re-render + memoize) to signals' push model (dependency tracking, surgical DOM updates), with a live implementation covering subscriptions, dirty-state tracking, and batching. Closes with the TC39 signals proposal and what native browser support eliminates for framework authors.

LLMs lack persistent world-state tracking

Kleinberg finds LLMs miscount objects in generated stories at 15-40% error rates Simons Institute for the Theory of Computing
TL;DW
LLMs fail basic world model tasks like counting people in narratives (~15% error rate), yet solve identical arithmetic instantly when framed as math problems—suggesting errors stem from attention allocation, not capability.
Order dependence in state tracking: describing budget categories from high-to-low causes systematic inflation across all categories, violating consistency expected from systems with genuine world models.
Repeated revision attempts reach chemical equilibrium, not zero errors—models fix some errors but reintroduce new ones at matching rates, creating stable error floors impossible for humans to eliminate through iteration alone.
Framing dramatically affects numerical accuracy: stories generate 15% errors, blog posts 9.5%, news articles 2.5%, and math problems ~0.2%—the same underlying capability behaves radically differently based on genre framing.
Myhill-Nerode theorem applied to sequence-generating systems: states can be extracted as equivalence classes of sequences, enabling principled probing for world models in game-playing, navigation, and constraint-satisfaction tasks.
Models maintain state propagation consistency (e.g., sports scores) even when starting from corrupted states (~80-94% transition accuracy), suggesting they represent implicit dynamics rather than absolute facts.
Compass direction tracking shows models confabulate details (shadows always point toward sunset regardless of direction traveled), indicating they optimize for narrative plausibility over geometric consistency.
Multi-model revision scheduling is solvable via Bellman equations: using cheap models early to reduce errors, then expensive models to grind out remaining errors, yields minimum-cost error-reduction strategies.
Navigation descriptions achieve 20% error rate per location even with tool use enabled, catching some errors while generating others—same model identifies failures it cannot prevent.
World models in LLMs may be fundamentally about our explanation of what's happening inside rather than the model's understanding of the world—making definition and measurement inherently observer-dependent.

Using Myhill-Nerode theorem analysis and navigation tasks, Cornell's Kleinberg shows LLMs lack persistent state maintenance during generation—models fail to track people and objects across narratives but catch the same errors when explicitly prompted, revealing a gap between language fluency and world-model coherence.

Zero-downtime database migration at scale

Stripe's DocDB moves petabytes across 2,000+ MongoDB shards with 5.5 nines uptime InfoQ
TL;DW
Stripe processes 1.4 trillion dollars annually with 5.5 nines reliability across 2,000+ MongoDB shards handling 5M+ queries per second.
Zero-downtime data movement platform uses versioned gating: proxy servers annotate requests with routing metadata version; source shard rejects stale versions until coordinator updates routes.
Traffic switch from source to target shard takes milliseconds to 2 seconds; all failed reads/writes succeed on client retries without manual intervention.
Optimized MongoDB bulk import throughput 10x by sorting data by index attributes before insertion, exploiting B-tree storage engine locality.
Bidirectional replication during migration enables fast rollback: writes tagged to prevent cyclical loops; source shard can be safely spun down after handoff.
Built DocDB (MongoDB-as-a-service in-house) instead of buying because Stripe requires custom security, reliability, performance, and multi-tenancy controls at financial scale.
Horizontal scaling, MongoDB version upgrades (entire 2000+ shard fleet), and single/multi-tenancy migrations all leverage same zero-downtime data movement platform.
Routing metadata updates propagate eventually-consistent to hundreds of stateless proxy servers; fencing at primary shard prevents stale proxies from serving old routes.
Migrate approximately 1.5–2 terabytes per target shard daily; total migration time depends on data size, index count, and ongoing write throughput on source shard.
In-house capabilities justify investment when they drive 3–5 year strategic advantage, require unique reliability/security/compliance controls, or reduce vendor lock-in risk.

Stripe's DocDB platform orchestrates zero-downtime shard splits, merges, version upgrades, and tenant migrations across 2,000+ MongoDB shards handling 5M+ queries per second. Point-in-time snapshots, CDC-based bidirectional replication, and proxy-layer versioned gating switch traffic in milliseconds without disrupting $1.4T in annual payment volume.

Hallucination as incentive misalignment

OpenAI finds evaluation rubrics, not training, drive LLM hallucinations Simons Institute for the Theory of Computing
TL;DW
Hallucinations in language models stem from test-taking incentives: models optimize for accuracy benchmarks without reward signals for admitting uncertainty, unlike humans who learn humility from real-world consequences.
Open rubric evaluation—explicitly stating scoring rules in prompts—aligns developer incentives with humble behavior; models respond immediately by saying 'I don't know' more when given credit for doing so.
Simple consistency check reduces hallucinations: query model twice, use third call to verify agreement; if inconsistent, output 'I don't know' instead of guessing.
Current accuracy-only benchmarks penalize humility and create a false trade-off between correctness and reduced hallucinations; this single metric drives deployment of overconfident models across all major LLM providers.
Language models are miscalibrated and overconfident; on SimpleQA benchmark, even giving 90% reward for saying 'I don't know' still beats model accuracy scores, revealing systematic miscalibration.
Hallucinations are not inevitable—they're a solvable mechanism design problem, not an inherent limitation of next-token prediction or model capacity.
Existing hallucination-reduction techniques (consistency checking, retrieval, self-critique) are already published and effective; the bottleneck is incentive structures, not algorithmic solutions.
Open rubrics are more objective and transparent than closed rubrics; they enable fair grading when developers and evaluators agree on scoring, unlike real-world chat where users don't state reward functions.

Hallucinations persist because accuracy-only metrics give models no reward for admitting uncertainty. Stating grading rules in prompts—open rubrics—shifts model behavior: when "I don't know" earns partial credit, models become calibrated and outperform baselines on both accuracy and hallucination rate.

LLMs break classical learning theory

CMU's Tom Mitchell argues LLMs break classical PAC learning across three paradigms Simons Institute for the Theory of Computing
TL;DW
LLMs enable explanation-based learning: systems generate natural-language justifications for labeled examples, distill them into interpretable rubrics, and improve classification without parameter tuning.
Feature engineering problem reframed: LLMs can autonomously suggest relevant predictors given only a target variable description (e.g., "predict flu hospitalizations"), eliminating manual feature selection.
Machine learning agents with self-reflection: systems generate their own learning subtasks by logging computations, analyzing failures via LLM-as-oracle, and iteratively debugging code to handle edge cases.
PAC learning framework requires extension: target functions now have natural-language definitions; hypothesis classes consist of learned rubrics plus LLM interpretation; sample complexity must account for representation ambiguity.
Conventional wisdom overturned: parameter tuning is no longer the dominant learning mechanism; big data plus statistics is insufficient; semantic knowledge representations with informal natural language now viable.
Explanation-based learning from 1980s-90s deserves revival: prior work on learning from explanations (e.g., single-example chess tactics) failed due to inability to generate explanations; LLMs now enable this paradigm.
Self-training and semi-supervised learning provide better theoretical framings than PAC learning for LLM-based systems: implicit inductive bias assumes LLM explanations are task-relevant, requiring ground truth data to focus learning.
Data ground truth critically focuses LLM reasoning: flipping all training labels produces plausible but incorrect justifications; ground truth prevents models from exploiting multiple plausible explanations.
Agents write and debug code autonomously: systems generate Python functions to interface with web APIs, merge datasets from heterogeneous sources, and maintain memory through persistent file storage.
Theory should model natural language representation and approximate reasoning: key open questions concern formalizing informality in natural-language descriptions, agents with pervasive self-reflection, and endogenous learning task generation.

Mitchell presents explanation-based learning, LLM-driven feature discovery, and autonomous self-reflecting agents as three paradigms that invalidate fixed hypothesis classes and parameter tuning. He frames the shift as analogous to compilers over assembly: LLM improvements still matter, but a new research layer opens above them.