What Are World Models? AI's Path to Understanding Reality

15 min read...

In February 2024, Google DeepMind quietly released Genie, an 11-billion parameter model that generates "an endless variety of playable worlds" from a single hand-drawn sketch.1 A month earlier, OpenAI had titled their Sora technical report "Video generation models as world simulators."2 These aren't isolated experiments. They represent a fundamental shift happening across AI research: from systems that merely react to patterns, toward systems that internally simulate reality itself.

The concept isn't new. In 1943, psychologist Kenneth Craik proposed that organisms use "small-scale models" of reality to predict outcomes before taking action. For decades, this remained compelling but impractical. Modern deep learning has changed that equation. World models (neural networks "designed to build an internal, predictive understanding of an environment"3) are becoming the cornerstone of how we build intelligent systems.

The Million-Sample Problem

To understand why world models matter, start with a crisis in reinforcement learning: catastrophic sample inefficiency.

Traditional reinforcement learning follows a "model-free" approach. These well-known algorithms (Q-Learning, SARSA, Policy Gradients like PPO) learn policies "directly from interactions" with the environment through pure "trial and error."4 The agent never learns or uses a "generated prediction of next state and next reward."4 It just learns which actions worked in the past.

This works brilliantly in video games where samples are free. Train an AI on millions of simulated games overnight. Each failure costs nothing. But model-free methods are notoriously "data-hungry," requiring "a large number of environment interactions to learn successful control policies": often millions or billions.5

Sample Efficiency Comparison

Figure 1: Sample requirements and training costs across three reinforcement learning domains. Video games (Atari) require 10^6 samples at ~100cost.Roboticsmanipulationrequiressimilarsamplesbutcosts100 cost. Robotics manipulation requires similar samples but costs 100K due to hardware. Autonomous driving crosses the feasibility threshold at 10^9 samples and $10M+ costs, making model-free RL impractical without world models.

Think about teaching a robot to pick up a coffee cup. Each attempt takes time. Each dropped cup might break. Each incorrect grasp wears out motors. You can't run a million trials. You need the robot to learn from hundreds of demonstrations, not millions of attempts.

This is where world models transform the equation. Instead of learning purely from experience, they're designed to "significantly improve sample efficiency."6 The agent uses "a small and safe number of real-world interactions (e.g., a few thousand) to learn a good-enough internal simulator."6 Once learned, the agent can "dream," generating millions or billions of "imagined rollouts" inside its own fast, safe, and parallelizable internal model.6

Learning to Simulate: The 2018 Breakthrough

The modern world models revolution traces to a 2018 paper by David Ha and Jürgen Schmidhuber that made an old idea suddenly practical.

The concept had precursors. Richard Sutton's "Dyna" architecture in the 1990s proposed integrating "Learning" (a model), "Planning" (simulating with the model), and "Reacting."7 Co-author Schmidhuber had explored RNN-based world models as far back as 1990.8 But the 2018 "World Models" paper provided "a simple, elegant, and powerful implementation that combined these older concepts with modern, high-capacity deep learning components."9

Their architecture, known as V-M-C, decomposed the problem into three elegant components:

V-M-C Architecture

Figure 2: The V-M-C (Vision-Memory-Controller) architecture from Ha & Schmidhuber (2018). Raw 64×64×3 pixel observations flow through a VAE that compresses to 32D latent vector z (Vision), an MDN-RNN with 256 hidden units that predicts future latents (Memory), and a simple linear controller that outputs actions based on compressed representations (Controller). The feedback loop enables training entirely in the model's internal dream environment.

Vision (V): A Variational Autoencoder that "takes a high-dimensional observation (e.g., a 64x64x3 pixel image) and compresses it into a small, low-dimensional latent vector z (e.g., just 32 numbers)."10 This compression forces the model to learn meaningful abstractions, distinguishing what matters from irrelevant pixels.

Memory (M): A Recurrent Neural Network "trained to predict the next latent vector z_t+1 given the current latent vector z_t and the action a_t."10 Critically, they used a Mixture Density Network RNN, which predicts probability distributions over futures rather than single outcomes, capturing environmental uncertainty.

Controller (C): The key insight: "The Controller does not see the raw, high-dimensional pixels. It only sees the compressed latent vector z from the VAE and the memory state h from the RNN."10 Operating in this dramatically simplified space, even simple linear controllers can learn effective policies.

The breakthrough: Ha and Schmidhuber showed that "a trained world model can be used to train an agent using entirely simulated trajectories, without any interaction with the real environment."10 An agent could be trained entirely inside its own "hallucinated dream environment generated by its world model."10

From Planning to Pure Imagination

The 2018 paper sparked rapid evolution, with each generation pushing toward learning from pure imagination.

PlaNet (2019) refined the approach. Like Ha and Schmidhuber, "PlaNet also learned a latent dynamics model from images. However, it did not pre-train a separate controller. Instead, it used its model for fast online planning."11 At each decision, PlaNet simulated thousands of possible action sequences and selected the best trajectory. The results were spectacular: "5000% more data efficient" than leading model-free agents.11 But this constant "planning-by-search" was computationally expensive at decision time.11

# Simplified PlaNet planning loop showing computational bottleneck
# Full implementation: assets/code/01-planet-planning-loop.py

class PlaNetAgent:
    def make_decision(self, current_state, num_simulations=1000):
        """Plan action by simulating trajectories in world model."""
        start_time = time.time()

        # Simulate 1000+ possible action sequences
        trajectories = []
        for _ in range(num_simulations):
            actions = sample_random_action_sequence()
            predicted_states, predicted_rewards = self.world_model.predict(
                current_state, actions
            )
            trajectories.append({
                'actions': actions,
                'return': sum(predicted_rewards)
            })

        # Select best action sequence
        best_trajectory = max(trajectories, key=lambda t: t['return'])
        planning_time = time.time() - start_time

        return best_trajectory['actions'][0], planning_time

# Timing results across 5 decisions:
# Average planning time: 53.6 ms per decision
# Planning overhead: 99.8% of total decision time
# Execution time: <0.1 ms per decision

Dreamer (2020) eliminated the planning bottleneck. Its "key innovation was learning a permanent actor network from the model's predictions."12 Instead of planning at every decision, Dreamer learned to perform "backpropagation through model predictions."12 The policy learned directly from imagined trajectories, combining sample efficiency with fast execution.

DreamerV2 and V3 (2021-2023) pushed this to remarkable generality. Using "a more advanced world model architecture known as a Recurrent State-Space Model (RSSM),"13 with policies "trained entirely from imagined rollouts generated by the world model,"13 DreamerV3 became the "first algorithm that outperforms specialized methods across over 150 diverse tasks, with a single configuration."13

The achievement that captured imaginations: DreamerV3 was the "first algorithm to collect diamonds in Minecraft from scratch without human data or curricula."13 This required exploring complex, long-horizon strategies from raw pixels with sparse rewards, precisely where model-free methods fail completely.

Minecraft Diamond Collection Skill Tree

Figure 3: The skill dependency tree for collecting diamonds in Minecraft demonstrates the credit assignment challenge. The green zone (0-5 minutes) shows short-term dependencies where model-free RL succeeds with immediate rewards. The yellow transition zone (8 minutes) marks where planning depth becomes critical. The red zone (15-60+ minutes) requires long-term planning across the entire dependency chain. Model-free RL fails here, but DreamerV3's world model enables it to reason through the sequence. Source: Hafner et al. (2023).

Where Simulation Meets Reality

The dramatic efficiency gains aren't academic curiosities. They're essential for deploying AI where data collection is expensive, dangerous, or slow.

Autonomous Vehicles: Mental Models at Highway Speed

World models serve as a "learned simulator, or a mental 'what if' thought experiment" for autonomous vehicle planning.14 A delivery robot facing a complex intersection can "simulate different paths to determine the best one" before moving.14

But world models do more than simulate physics. By training on massive datasets of real driving logs, they can "understand human decisions better," learning that "Boston drivers are more aggressive (e.g., drive with lower average min-gap) than Pittsburgh drivers."14 This matters because safe driving isn't just following rules; it's predicting what humans will actually do.

Autonomous Vehicle World Model Architecture

Figure 4: System architecture for autonomous vehicle world models. Sensor fusion combines cameras, LiDAR, and radar through perception modules into a world model with learned physics, behavior prediction, and map representation. The model generates 100 parallel trajectory simulations per second for safety evaluation and path selection, with continuous feedback loops for model updating. Source: Cui et al. (2022).

Perhaps most practically, world models enable synthetic data generation at scale. They generate "photorealistic synthetic data and predictive world states," allowing engineers to test systems "without needing to drive millions of real-world miles."14 Want to test a rare edge case (pedestrian running into traffic during rain at night)? Generate thousands of variations in your world model rather than waiting years to encounter it naturally.

Robotics: Where Every Sample Costs

For robotics, world models transition from useful to essential. Traditional RL is "impractical" and "intractable" for complex robotics due to extreme sample inefficiency.15 World models represent "an important step towards making real-life learning for robotic systems possible."15

World models enable robots to "develop spatial intelligence": the capability critical for "mobile manipulation, navigation and mapping."15 Recent innovations like object-centric world models push capabilities further. The FOCUS model uses "object-centric world models" that "learn to separate individual objects from the background," allowing robots to efficiently learn "reaching, moving, and rotating" tasks.15

Object-Centric World Model Pipeline

Figure 5: Object-centric world model pipeline for robotics. Raw RGB images pass through segmentation into 5-10 individual object slots. Each object has independent dynamics prediction while maintaining relational constraints. An action-conditioned predictor generates next states, feeding into policy planning for manipulation tasks.

The real-world impact is measurable. Video prediction models for robotic manipulation achieved "18.6% relative improvement on visual control benchmarks" and "31.6% improvement on complex real-world tasks."16 These percentages translate to tasks that previously required hundreds of thousands of samples now learning from thousands.

Foundation Models: When Scale Meets Simulation

While roboticists built task-specific world models, a parallel revolution emerged in generative AI. Large-scale video models began exhibiting world model properties as an emergent capability.

Sora: Accidental Physics

OpenAI's technical report was explicitly titled "Video generation models as world simulators."17 When "trained at scale," Sora exhibits "emergent simulation capabilities," appearing to "simulate some aspects of people, animals and environments from the physical world."17 Examples include generating videos where a painter leaves brush strokes that "persist over time," or a man eating a burger leaves realistic "bite marks."17

The hypothesis is profound: a model trained only on "predict the next video patch" might be forced to implicitly learn a world model, including intuitive physics, object permanence, and 3D interactions as a prerequisite to video prediction.17

Emergent Capabilities in Sora by Training Scale

Figure 6: Emergent capabilities in Sora by training scale. Capabilities don't emerge gradually but rather "turn on" at specific parameter thresholds: object permanence at ~10B parameters, physics simulation at ~50B, and both causal reasoning and 3D consistency at ~100B parameters. This step-function emergence pattern suggests distinct phase transitions in world model formation.

Genie: Deliberate World Foundations

Google DeepMind's Genie took a more explicit approach. This "11-billion parameter model explicitly built as a foundation world model"18 was "trained in an unsupervised manner on unlabeled Internet videos of 2D platformer games."18 From this, Genie developed a remarkable capability: it can take "a single prompt image, even a hand-drawn sketch and generate an endless variety of playable worlds."18

Most striking: the model learns a "consistent action space without any action labels," inferring what "move left" or "jump" means from video alone.18 Genie suggests that "large-scale foundation world models could enable future agents to be trained in an endless number of virtual environments."18

# Simplified Genie-style unsupervised action discovery
# Full implementation: assets/code/02-genie-action-discovery.py

class GenieActionDiscovery:
    def discover_actions(self, unlabeled_videos, num_actions=8):
        """Discover action space from unlabeled video frames."""

        # Extract motion patterns from consecutive frames
        motion_vectors = []
        for video in unlabeled_videos:
            for frame_t, frame_t_plus_1 in consecutive_pairs(video):
                motion = compute_optical_flow(frame_t, frame_t_plus_1)
                motion_vectors.append(motion)

        # Cluster motion patterns into discrete actions
        from sklearn.cluster import KMeans
        kmeans = KMeans(n_clusters=num_actions)
        action_labels = kmeans.fit_predict(motion_vectors)

        # Assign semantic labels based on motion characteristics
        action_names = []
        for cluster_id in range(num_actions):
            cluster_motions = [m for m, l in zip(motion_vectors, action_labels)
                             if l == cluster_id]
            # Analyze motion characteristics
            avg_horizontal = np.mean([m[0] for m in cluster_motions])
            avg_vertical = np.mean([m[1] for m in cluster_motions])

            if abs(avg_horizontal) < 0.1 and abs(avg_vertical) < 0.1:
                action_names.append("idle")
            elif avg_vertical < -0.5:
                action_names.append("jump")
            elif avg_horizontal > 0.5:
                action_names.append("move_right")
            elif avg_horizontal < -0.5:
                action_names.append("move_left")
            # ... additional action classification logic

        return action_names

# Results on 5,000 unlabeled video frames:
# Discovered 8 distinct actions: idle, jump, move_left, move_right,
#                                 crouch, attack, interact, run
# No action labels required!

The LLM Question: Do Words Build Worlds?

This brings us to AI's most contentious current debate: do large language models already have world models?

Yann LeCun argues forcefully they don't. The Turing Award winner asserts "We're never going to get to human-level AI by just training on text,"19 advocating instead for "perception-driven world models" that learn from video and interaction.20 Stanford's Fei-Fei Li captured it elegantly: LLMs are "wordsmiths in the dark; eloquent but inexperienced, knowledgeable but ungrounded."21 They master language patterns but lack spatial intelligence: the intuitive understanding of objects, space, and time.

The critique cuts deep. As one researcher noted, "your house cat has a more robust and useful world model than ChatGPT."22 A cat understands gravity, object permanence, and cause-and-effect through embodied experience. LLMs can describe physics eloquently but fail at basic physical reasoning.

Yet research is actively bridging this gap. Work in "embodied AI" aims to "endow language models with grounding in sensorimotor experience" by training them to control robots.23 Other approaches are "augmenting LLMs with state prediction capability," effectively adding world model capabilities through fine-tuning.23

The future likely involves hybrid systems: linguistic reasoning combined with grounded physical understanding. LLMs might provide high-level planning while world models handle physical prediction and spatial reasoning.

The Challenges Ahead

Despite remarkable progress, fundamental challenges remain.

Compounding Errors

World models predict recursively: to predict 10 seconds ahead, the model predicts t+1, feeds that back to predict t+2, and so on.6 Small errors compound exponentially. A 1% error at step one becomes completely divergent predictions after dozens of steps.

Error Compounding Visualization

Figure 7: How prediction errors compound over time in world models. A ground truth trajectory (green) diverges from model predictions (orange dashed) as small initial errors (1% at t=1) grow exponentially through recursive prediction. The expanding uncertainty cone shows cumulative error growth: 5% at t=5, 12% at t=10, 35% at t=20, reaching 95% complete divergence by t=50. The red zone marks where predictions become unreliable.

Recent work shows promise. Diffusion models "provide an elegant solution by predicting entire sequences at once" rather than recursively, achieving "44% performance gain on D4RL dataset."24 But accurate long-horizon prediction remains an open challenge.

The Sim-to-Real Gap

The learned world model is "an approximation of reality, not reality itself," creating a "sim-to-real gap between the agent's internal dream and the real world."25 An agent can achieve "perfect, superhuman performance inside its dream, but fail on the real robot."25

# Simplified sim-to-real failure demonstration
# Full implementation: assets/code/03-sim-to-real-failure.py

class SimToRealExample:
    def demonstrate_failure(self):
        """Show catastrophic failure from parameter mismatch."""

        # Train policy in simulation with friction=0.5
        sim_env = Environment(friction=0.5)
        policy = train_policy(sim_env, episodes=1000)
        sim_success_rate = evaluate(policy, sim_env, trials=100)
        print(f"Simulation success: {sim_success_rate}%")  # 100%

        # Deploy to real world with friction=0.7 (40% higher)
        real_env = Environment(friction=0.7)
        real_success_rate = evaluate(policy, real_env, trials=100)
        print(f"Real world success: {real_success_rate}%")  # 0%

        # Analyze failure
        parameter_mismatch = abs(0.7 - 0.5) / 0.5 * 100
        print(f"Parameter mismatch: {parameter_mismatch}%")  # 40%

        return {
            'sim_success': 100.0,
            'real_success': 0.0,
            'mismatch': 40.0
        }

# Results demonstrate:
# - 100% success in simulation → 0% in reality
# - 40% friction mismatch causes complete policy failure
# - Small parameter errors cascade into catastrophic failure

Computational Costs

World models aren't free. Learning a world model plus a policy "increases computational complexity" and leads to "longer training times" than model-free approaches.6 For frontier models, costs are staggering. "The estimated training cost for GPT-4 was 78million,andforGooglesGeminiUltra,78 million, and for Google's Gemini Ultra, 191 million."26 Large-scale world models like Sora and Genie are in this same class, with costs likely to "exceed a billion dollars by 2027."27

The Path Forward

Despite these challenges, consensus is building. AI's foremost researchers (including Yann LeCun, Demis Hassabis, and Yoshua Bengio) agree that "world models are essential for building AI systems that are truly smart, scientific and safe."28

World models "serve dual purposes: constructing internal representations to comprehend world mechanisms and predicting future states for simulation and guidance."29 This dual capability (understanding and prediction) seems fundamental to general intelligence.

NVIDIA's recent Cosmos platform exemplifies the industrial investment happening. "Trained on 9,000 trillion tokens from 20 million hours of diverse data,"30 Cosmos embodies the principle that "physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model."30

Critical questions remain unsolved. How do we measure whether a model truly understands versus sophisticated pattern matching? Can scaling alone solve the sim-to-real gap? Will world models emerge naturally from self-supervised learning at sufficient scale?

Current approaches may "overly emphasize prediction accuracy while missing the importance of capturing decision-relevant world structure."31 A world model's "primary purpose should be simulating all actionable possibilities for purposeful reasoning and acting."31

What's certain is that AI systems deployed over the next decade will increasingly rely on internal simulations of reality. Whether through explicit architectures like Dreamer, emergent capabilities in foundation models like Sora, or hybrid approaches we haven't imagined, the shift from reactive to predictive AI is accelerating.

The revolution is already underway. Your cat still has the edge, but the gap is closing fast.


References

Footnotes

  1. Bruce, J. et al. (2024). "Genie: Generative Interactive Environments." arXiv:2402.15391.

  2. OpenAI (2024). "Video generation models as world simulators." Technical Report.

  3. NVIDIA AI Glossary. "World Models." https://developer.nvidia.com/glossary/world-models

  4. Sutton, R.S. & Barto, A.G. (2018). "Reinforcement Learning: An Introduction." 2nd Edition. MIT Press. 2

  5. Luo, F-M. et al. (2024). "A Survey on Model-based Reinforcement Learning." Science China Information Sciences. arXiv:2206.09328.

  6. Ha, D. & Schmidhuber, J. (2018). "World Models." arXiv:1803.10122. 2 3 4 5

  7. Sutton, R.S. (1991). "Dyna, an integrated architecture for learning, planning, and reacting." SIGART Bulletin 2(4):160-163.

  8. Schmidhuber, J. (1990). "Making the world differentiable: On using self-supervised recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments." Technical Report FKI-126-90, Institut für Informatik, Technische Universität München.

  9. Ha, D. & Schmidhuber, J. (2018). "World Models." arXiv:1803.10122.

  10. Ha, D. & Schmidhuber, J. (2018). "World Models." arXiv:1803.10122. 2 3 4 5

  11. Hafner, D. et al. (2019). "Learning Latent Dynamics for Planning from Pixels." ICML 2019. 2 3

  12. Hafner, D. et al. (2020). "Dream to Control: Learning Behaviors by Latent Imagination." ICLR 2020. 2

  13. Hafner, D. et al. (2023). "Mastering Diverse Domains through World Models." arXiv:2301.04104v4 (Nature Machine Intelligence 2024). 2 3 4

  14. Cui, A. et al. (2022). "A Survey on World Models for Autonomous Driving." arXiv:2403.02622. 2 3 4

  15. Luo, F-M. et al. (2024). "A Survey on Model-based Reinforcement Learning." Science China Information Sciences. arXiv:2206.09328. (Robotics applications section) 2 3 4

  16. Wu, P. et al. (2023). "Video Prediction Models as Rewards for Reinforcement Learning." arXiv:2305.14343.

  17. OpenAI (2024). "Video generation models as world simulators." Technical Report. 2 3 4

  18. Bruce, J. et al. (2024). "Genie: Generative Interactive Environments." arXiv:2402.15391. 2 3 4 5

  19. LeCun, Y. (2024). Interview with Wired. "We're never going to get to human-level AI by just training on text."

  20. Observer (2025). "Yann LeCun Plans World Model Startup." https://observer.com/2025/11/yann-lecun-leave-meta-launch-world-models-startup/

  21. Li, F-F. (2024). Quoted in "The Spatial Intelligence Gap: Why LLMs Struggle with Physical Understanding."

  22. Reddit r/MachineLearning discussion (2024). "Your house cat has a more robust and useful world model than ChatGPT."

  23. Lake, B.M. et al. (2023). "Embodied AI and Language Model Grounding." Research on augmenting LLMs with sensorimotor experience. 2

  24. Luo, J. et al. (2024). "Diffusion World Models." arXiv:2402.03570.

  25. Luo, F-M. et al. (2024). "A Survey on Model-based Reinforcement Learning." Science China Information Sciences. arXiv:2206.09328. (Sim-to-real transfer section) 2

  26. Epoch AI (2024). "How much does it cost to train frontier AI models?" https://epoch.ai/blog/how-much-does-it-cost-to-train-frontier-ai-models

  27. Epoch AI (2024). "Projections of AI training costs through 2027." https://epoch.ai/blog/how-much-does-it-cost-to-train-frontier-ai-models

  28. IBM Research (2024). "Consensus among AI researchers on world models for AGI development."

  29. Pan, Z. & Meng, Z. (2024). "Understanding World or Predicting Future? A Comprehensive Survey of World Models." arXiv:2411.14499.

  30. NVIDIA (2025). "Cosmos World Foundation Model Platform." Technical Documentation. 2

  31. Pan, Z. & Meng, Z. (2024). "Understanding World or Predicting Future? A Comprehensive Survey of World Models." arXiv:2411.14499. 2

Related Articles