Thursday, May 7, 2026 — front page

Kubernetes load balancing flaw at scale

Databricks drops Kubernetes load balancing for client-side power-of-two-choices, cuts fleet 20% USENIX
TL;DW
Kubernetes load balances connections, not requests; gRPC over HTTP/2 multiplexes thousands of requests per connection, causing severe traffic skew (4-5x) across pods.
Connection resets, headless services with DNS, and service meshes were rejected as solutions due to CPU overhead, DNS caching issues, control plane inflexibility, or operational complexity.
Databricks built a custom endpoint discovery service (EDS) using push-based Xds protocol to replace Kubernetes load balancing with client-side intelligence.
Power-of-two-choices algorithm: randomly select two pods, score each by pending requests plus latency and error rate signals, send to the pod with better score.
Power-of-two-choices naturally avoids thundering herd problems unlike least-request algorithms, and is simple, cheap, and easily extensible with new scoring signals.
Uniform request distribution via power-of-two-choices reduced average fleet size by 20% through better autoscaling decisions and latency stability.
Naive round-robin distribution initially caused cold-start errors when new pods came online; power-of-two-choices with error-rate bias improved robustness during rollouts.
Solution relies on Kubernetes as control plane source but decouples data plane; works across three clouds, 1500+ clusters, and 70+ regions at Databricks scale.

Kubernetes balances connections, not requests; with gRPC over HTTP/2, some pods received orders-of-magnitude more traffic, causing SLO violations. Databricks built an XDS-based endpoint discovery service enabling client-side scoring on pending requests, latency, and error rate, achieving even distribution and a 20% fleet size reduction with no proxy overhead.

agentic commerce payment standards

Stripe proposes UCP and Machine Payments Protocol to give AI agents safe, authorized purchase flows Stripe Developers
TL;DW
Universal Commerce Protocol (UCP) enables agents to make purchases through structured API calls and JSON responses instead of scraping HTML forms and clicking buttons.
Shared Payment Token mechanism allows agents to send payment credentials to sellers with enforced spending limits—Stripe will decline charges exceeding the pre-agreed amount.
Machine Payments Protocol (MPP) uses HTTP 402 status code to declare resources requiring payment, enabling agents to pay micropayments for API calls and data on-demand.
Agents require new payment capabilities: machines need to buy API access, compute resources, and data—not t-shirts—creating distinct commerce patterns from human purchasing.
Crypto and blockchain settlement offer faster instant payment for agent-to-agent commerce, likely to dominate as instant digital transactions between machines eclipse traditional human commerce volume.
Open protocols and standards prevent siloing and lock-in that plagued centralized platforms like social media—critical safeguard when agentic commerce becomes 100x more powerful.
Agents can handle multiple competing protocols and payment methods simultaneously, unlike humans who struggle with complexity—protocols don't need to converge as quickly as they did historically.
Verification and acceptance testing loops close the gap between what agents are instructed to do and what they actually execute—essential new pattern for trustworthy agentic commerce.
Sellers retain full control of their backend and payment stack with UCP; agents simply consume standardized APIs instead of automating human checkout flows.
Machine Payments Protocol works with fiat, crypto, and shared payment tokens, supporting both subscription and usage-based billing models emerging in agent commerce.

Universal Commerce Protocol replaces HTML scraping with API-driven checkout and cryptographically enforced spending-limit tokens; Machine Payments Protocol revives HTTP 402 for per-request settlement on digital goods and API calls. Both protocols support fiat and crypto, with panelists from Block and Alchemy pressing for open standards over proprietary silos.

AI red teaming limits

AI red teaming can't eliminate prompt injection — only shrink the blast radius NDC Conferences
TL;DW
AI red teaming is fundamentally different from traditional security testing: attacks manipulate behavior, not exploit code bugs, with zero repeatability even at temperature zero.
LLMs hallucinate convincingly and reliably produce false outputs; black-box red teaming is useless without accessing real data to validate whether findings are genuine.
Prompt injection is an inherent flaw in LLM architecture due to lack of hierarchy between trusted system data and untrusted user/tool data in the context window.
Indirect and cross-prompt injection vectors like resume uploads, RAG databases, emails, logs, and poisoned web content are harder to detect than direct prompts.
Common attack techniques include jailbreaks (defeating model guardrails), prompt injection (attacking deployment framework), crescendo attacks (gradual escalation), and adversarial suffixes (mathematically optimized tokens).
Agents are far more dangerous than chatbots because they take real-world actions based on hallucinations and can exhibit emergent behaviors when multiple agents interact.
Existing AI red team tools are immature: they test for easy toxicity and jailbreaks but don't test real security concerns like data exfiltration or agent permission escalation.
Shift-left security into AI-native CICD pipelines with baseline benchmark testing and attack prompt libraries so developers can evaluate model safety during development.
Semantic analysis using AI is required for defense and attack because language is nearly infinite; naive string filtering (ignoring keywords) fails immediately and cannot prevent prompt injection.
Accept that you cannot eliminate prompt injection risk; instead, build mitigating controls around agents and clearly communicate residual risk to stakeholders.

Transformer architecture makes prompt injection structurally unavoidable, so NDC's session shifts focus to creative adversarial testing: jailbreaks, context poisoning, crescendo attacks, and adversarial suffixes. Covers Crop Duster and Tapper for AI-powered red teaming, and argues current vendor tooling misses real business risks like agent misbehavior and data exfiltration.

unified multimodal architecture

Luma unifies text, image, video, and audio in one transformer backbone to add reasoning to generation Stanford Online
TL;DW
Luma unified its language, vision, and video into one transformer backbone architecture rather than separate towers, enabling models to reason about all modalities in the same space—similar to how the human brain processes information centrally.
Shifted from 3D-first strategy to video-first after discovering that data scale in internet-available video dramatically outpaces proprietary 3D capture; algorithm design must follow data availability, not the reverse.
Dream Machine (March 2024) bootstrapped the flywheel by capturing preference signals from user downloads and likes, then built annotation systems to filter poor outputs and systematically improve subsequent versions.
Unified models must handle multi-turn interactions with memory, unlike current image/video models that are one-shot generators; this multi-turn capability was critical to making language models generally useful (RLHF → ChatGPT).
Creatives report newfound freedom to explore unconstrained iterations rather than vet single ideas exhaustively; prolific creators (Mozart, Einstein, Archimedes) thrive when able to try many variations and select the best outcomes.
Luma integrates domain-specific skills (e.g., 50-page slide design guide, energy grid diagrams) as context layers above the unified model, allowing knowledge transfer without retraining and enabling superiority over text-only models on specialized tasks.
Current image and video models lack physical understanding, temporal coherency, and introspection—unified models solve this by combining language intelligence with visual generation, enabling uses like educational videos showing counterfactual historical scenarios.
Hollywood's production decline stems from PE-driven franchise rentseking (multiple sequels, crossovers) over diverse storytelling; Netflix's 800 annual productions at $10–50M budgets versus major studios' 5–20 prove audience demand for variety, not sequels.
Unified architecture enables end-to-end work via REPL loops (read-eval-print) with one model orchestrating tool calls, context, and iterative refinement—mirroring the von Neumann architecture that powered computers for decades.
Major studios (Netflix, Amazon Prime) enforce data isolation guarantees via SOC 2 controls and marked project tracking to prevent training data leakage between competing productions, enabling trust with high-sensitivity customers.

Amit Jain outlines why video encodes 3D geometry through time, making it richer training data than images, then explains how Luma's single shared-latent-space transformer enables multi-turn dialogue and iterative refinement — capabilities absent from diffusion-only or modality-siloed architectures.

long-horizon multi-agent execution

Factory runs software projects for 16 days autonomously via serial agents and validation contracts AI Engineer
TL;DW
Factory's 'missions' system runs multi-agent teams serially on features with targeted parallelization, achieving 16-day autonomous runs without human intervention.
Validation contracts defined during planning—not after coding—establish correctness independently of implementation, preventing drift in long-running agent systems.
Missions combine five multi-agent patterns: delegation, creator-verifier, broadcast, negotiation, and structured handoffs across orchestrator, worker, and validator roles.
Three-role architecture: orchestrator plans with validation contracts, workers implement features with clean context, validators verify both code quality and end-to-end behavior through computer use.
Serial feature execution with read-only parallelization prevents agent conflicts and duplicated work, reducing errors dramatically despite appearing slower on paper.
Validation includes dedicated code review agents and QA agents that interact with running applications—neither has seen the code, ensuring adversarial validation by design.
Right model selection per role ('droid whispering') matters: planning needs reasoning, implementation needs fluency, validation needs instruction-following—no single model excels at all three.
Structured handoffs between agents document what was completed, attempted, left undone, exit codes, and issues discovered, enabling self-healing at milestone boundaries.
Slack clone example shows 60% time/tokens on implementation, tests comprise 50% of final code, 90% coverage, validation fails first attempt then creates follow-up features.
Prompt-based orchestration logic (700 lines) instead of hard-coded state machines ensures missions improve with each new model release rather than becoming obsolete.

Factory's Missions system chains planner, worker, and validator agents serially—avoiding conflicts from parallelization—with a correctness contract defined before coding begins. Workers inherit clean state from predecessors; validators span linting, type-checking, and live user-testing. Longest production run: 16 days.