Wednesday, May 6, 2026 — front page

unified multimodal intelligence architecture

Luma unifies text, image, video, and audio in one transformer backbone to add reasoning to generation Stanford Online
TL;DW
Luma unified its language, vision, and video into one transformer backbone architecture rather than separate towers, enabling models to reason about all modalities in the same space—similar to how the human brain processes information centrally.
Shifted from 3D-first strategy to video-first after discovering that data scale in internet-available video dramatically outpaces proprietary 3D capture; algorithm design must follow data availability, not the reverse.
Dream Machine (March 2024) bootstrapped the flywheel by capturing preference signals from user downloads and likes, then built annotation systems to filter poor outputs and systematically improve subsequent versions.
Unified models must handle multi-turn interactions with memory, unlike current image/video models that are one-shot generators; this multi-turn capability was critical to making language models generally useful (RLHF → ChatGPT).
Creatives report newfound freedom to explore unconstrained iterations rather than vet single ideas exhaustively; prolific creators (Mozart, Einstein, Archimedes) thrive when able to try many variations and select the best outcomes.
Luma integrates domain-specific skills (e.g., 50-page slide design guide, energy grid diagrams) as context layers above the unified model, allowing knowledge transfer without retraining and enabling superiority over text-only models on specialized tasks.
Current image and video models lack physical understanding, temporal coherency, and introspection—unified models solve this by combining language intelligence with visual generation, enabling uses like educational videos showing counterfactual historical scenarios.
Hollywood's production decline stems from PE-driven franchise rentseking (multiple sequels, crossovers) over diverse storytelling; Netflix's 800 annual productions at $10–50M budgets versus major studios' 5–20 prove audience demand for variety, not sequels.
Unified architecture enables end-to-end work via REPL loops (read-eval-print) with one model orchestrating tool calls, context, and iterative refinement—mirroring the von Neumann architecture that powered computers for decades.
Major studios (Netflix, Amazon Prime) enforce data isolation guarantees via SOC 2 controls and marked project tracking to prevent training data leakage between competing productions, enabling trust with high-sensitivity customers.

Amit Jain outlines why video encodes 3D geometry through time, making it richer training data than images, then explains how Luma's single shared-latent-space transformer enables multi-turn dialogue and iterative refinement — capabilities absent from diffusion-only or modality-siloed architectures.

multi-agent autonomous software development

Factory runs software projects for 16 days autonomously via serial agents and validation contracts AI Engineer
TL;DW
Factory's 'missions' system runs multi-agent teams serially on features with targeted parallelization, achieving 16-day autonomous runs without human intervention.
Validation contracts defined during planning—not after coding—establish correctness independently of implementation, preventing drift in long-running agent systems.
Missions combine five multi-agent patterns: delegation, creator-verifier, broadcast, negotiation, and structured handoffs across orchestrator, worker, and validator roles.
Three-role architecture: orchestrator plans with validation contracts, workers implement features with clean context, validators verify both code quality and end-to-end behavior through computer use.
Serial feature execution with read-only parallelization prevents agent conflicts and duplicated work, reducing errors dramatically despite appearing slower on paper.
Validation includes dedicated code review agents and QA agents that interact with running applications—neither has seen the code, ensuring adversarial validation by design.
Right model selection per role ('droid whispering') matters: planning needs reasoning, implementation needs fluency, validation needs instruction-following—no single model excels at all three.
Structured handoffs between agents document what was completed, attempted, left undone, exit codes, and issues discovered, enabling self-healing at milestone boundaries.
Slack clone example shows 60% time/tokens on implementation, tests comprise 50% of final code, 90% coverage, validation fails first attempt then creates follow-up features.
Prompt-based orchestration logic (700 lines) instead of hard-coded state machines ensures missions improve with each new model release rather than becoming obsolete.

Factory's Missions system chains planner, worker, and validator agents serially—avoiding conflicts from parallelization—with a correctness contract defined before coding begins. Workers inherit clean state from predecessors; validators span linting, type-checking, and live user-testing. Longest production run: 16 days.

LLM dialogue comprehension failure modes

GPT-2 hallucinates speaker switches in dialogue, mirroring human Moses illusion PyData
TL;DW
Language models hallucinate speaker transitions in dialogue, expecting speakers to alternate even when same speaker continues—likely because training data heavily emphasizes speaker switches, not because models truly understand conversational norms.
Formal linguistic competence (grammar, syntax) does not equal functional competence in interactive dialogue; models can produce grammatical text while failing at turn-taking and speaker identity tracking.
Language models hallucinates inputs—misinterpreting what was said—not just outputs; this input-level hallucination may explain some dialogue failures, mirroring human semantic illusions like the Moses illusion.
Examining probability distributions and attention weights inside models can reveal rare failure cases without sampling thousands of outputs; internal structure analysis finds edge cases more efficiently than probabilistic sampling.
Scaling model size alone does not fix speaker-tracking failures; GPT-2 exhibited this problem years ago, and larger modern models still struggle with same-speaker continuations, contradicting "bigger solves everything" assumption.
Transformers predict only words explicitly, not speech acts or social intent that humans predict; humans predict conversation types (complaint, request, statement), adding representational layers transformers lack.
Language models perform worse than humans at theory-of-mind tasks (reasoning about others' false beliefs), scoring at 4-5 year-old human level, limiting their dialogue competence in social interaction.
Fine-tuning models on dialogue transcripts helps but doesn't eliminate speaker-hallucination errors; models matched human surprisal patterns for speaker switches but reversed expectations for same-speaker continuations.
Adding long-term memory structures (RAG, GraphRAG) to transformer backbones could help close human-model gaps, but current architecture alone cannot replicate human attention-memory interactions that ground dialogue understanding.
Human audiences unknowingly demonstrate the same semantic hallucination bias as language models; majority incorrectly recalled Popeye eating spinach for smartness instead of strength, showing humans hallucinate expected inputs too.

Julia Mertens fine-tunes GPT-2 on dialogue data and measures surprisal against human reading times on controlled stimuli. The model treats natural same-speaker continuations as more surprising than incongruent ones—a reversal that persists regardless of scale and mirrors the Moses illusion, where prior representations override bottom-up input processing.

agentic AI real security priorities

NDC: Data lake exposure eclipses prompt injection as critical risk in agentic systems NDC Conferences
TL;DW
Prompt injection receives disproportionate focus—spend 90% of AI security budget on it and miss critical threats like data lake breaches that destroy enterprises.
Data lake access is the most dangerous security surface: treat every connection as internet-exposed; require security engineers to personally review and execute all data queries, never allow direct access.
Split data lakes by workload type (agentic vs. classic, active vs. static) to apply appropriate security controls and prevent cross-contamination between agents, teams, and business units.
Guard models (Llama Guard, IBM Granite, GPT-O) provide free, open-source ingress/egress sanitization layers—they take text in, output safe/dangerous, replacing manual prompt filtering.
Parameterize inputs and keep arbitrary data outside AI systems: don't send real emails, credit cards, or URLs to agents; tokenize or primitize them instead.
Encryption key material cannot be deep-faked by AI—encrypt sensitive data at rest and in transit, only decrypting at the correct model/workload/customer combination.
Treat attack response time in seconds/minutes with AI, not hours/days: implement kill switches and company-wide character set sanitization at the gateway, not just per-application.
Track injection attacks at each infrastructure layer differently: gateway hits are normal; cloud hits are concerning; backend database hits signal breach; agentic layer hits indicate lateral movement or insider threat.
Learn model families, training data sources, and known vulnerabilities for each model your engineers propose—cursory knowledge of model security prevents deployment mistakes.
Assign lifecycle ownership and maintenance costs to each security control; with limited team capacity, prioritize controls you'll actually sustain rather than deploying 30 controls and maintaining none.

Jon McCoy argues prompt injection concern reflects survivor bias while the real threat is multi-agent, multi-team data lake access without workload isolation. Recommends treating data lake connections as internet-exposed endpoints, decomposing lakes by workload, and having security teams own every data pull.