Jim Fan’s talk reveals the bigger shift: AI is moving from predicting tokens to predicting the physical world
I watched Jim Fan’s Sequoia AI Ascent talk expecting a robotics update.
But realized that Jim is giving us heads up that AI is about to breakout of text boxes and into our daily lives, “Physical AI” as the cool kids call it these days 😉
LLMs scaled because next-token prediction became a general-purpose simulator of language
At small scale, it looked like autocomplete. At scale, it learned the latent structure of grammar, code, reasoning, style, knowledge, and intent.
Physical AI is trying to build the equivalent simulator for the real world.
The prediction target changes from:
next token
to:
next world state
That means learning how physical scenes evolve: motion, contact, gravity, occlusion, force, friction, lighting, deformation, slippage, failure, and recovery.
Generative video suddenly looks different through this lens. The useful part is not the generated clip. It is the learned approximation of dynamics underneath it. To predict future frames, a model has to learn something about object permanence, movement, lighting, transitions, spatial consistency, and cause-effect over time.
Cosmos becomes the first layer of NVIDIA’s Physical AI stack: a world model for controllable physical-world generation, synthetic data, and long-tail scenarios.
Generated worlds still need structure.
Omniverse provides the spatial and industrial backbone: assets, geometry, materials, lighting, sensors, digital twins, factories, roads, rooms, machines, and workcells. It turns generated scenes into structured worlds.
Structured worlds then need to become training environments.
Isaac Sim and Isaac Lab provide the robotics learning layer: simulation, policy training, benchmarking, randomization, resets, stress testing, imitation learning, and reinforcement learning.
This is why Fan’s line is so powerful:
compute = environment = data
In LLMs, compute scaled token learning.
In Physical AI, compute scales environments: more rollouts, more failures, more counterfactuals, more rare cases, and more policy improvement before real-world deployment.
The trained system then needs an action policy.
GR00T is NVIDIA’s push toward the policy layer: connecting vision, language, world context, and real-time motor action. It moves the system from predicting what may happen to acting in a way that changes what happens.
The stack becomes clear:
Cosmos models the world. Omniverse structures the world. Isaac trains inside the world. GR00T acts in the world.
That is NVIDIA’s Physical AI factory.
The factory, however, does not solve deployment by itself.
Once Physical AI leaves simulation, the bottleneck shifts from model scale to field execution: latency, power, thermal limits, camera pipelines, sensor fusion, wireless connectivity, local safety, and cost.
That is Qualcomm’s terrain.
A model trained in a GPU cluster still has to run inside drones, cameras, vehicles, AMRs, industrial devices, safety systems, and embedded machines. These endpoints need local inference, efficient NPUs, camera and sensor processing, connectivity, and enough performance per watt to survive outside the lab.
A split strategy is emerging.
NVIDIA is building the factory where Physical AI learns.
Qualcomm is building the fabric where Physical AI runs.
Training pulls toward accelerated compute, synthetic worlds, simulation, and policy learning.
Deployment pulls toward efficient, connected, sensor-rich edge platforms.
At some layers they will overlap. In others, they may cooperate.
But the architecture is becoming visible.
Physical AI is the stack that lets machines perceive the world, predict the next state, act under physical constraints, and learn from the result.
ChatGPT made AI conversational.
Physical AI makes AI executable.
VLA Was the Bridge, Not the Destination
This slide explains why the first wave of robot foundation models looked the way they did.
Take an image. Add an instruction. Pass both into a vision-language model. Attach an action head. Decode motor commands.
That is the basic shape of Vision-Language-Action models.
It was a necessary step. VLA gave robots a way to connect perception, language, and action. Instead of hand-coding every behavior, you could prompt the system with a task and let the model translate visual context into movement.
Models like GR00T N1.7, π0.7, RT-2, and OpenVLA sit in this family.
But Jim Fan’s critique is that these models are still “head-heavy” in the wrong place. Much of the intelligence sits inside the language and vision backbone. The action layer is often attached at the end.
That makes VLA good at semantics.
It can understand objects, instructions, categories, and relationships.
But physical work is dominated by dynamics.
A robot does not only need to know what the vegetable is or where the grocery bag is. It needs to understand whether the object will roll, slip, deform, collide, resist, fall, or block another motion. It needs to predict how the scene changes when the robot acts.
That is the limitation of a language-first architecture.
It knows the task.
It does not necessarily understand the physics of the task.
Fan’s next move is the important one: shift from VLA to world-action models.
The goal is not just:
vision + language → action
The goal is closer to:
world state + action → future world state
Once the model can predict how the world changes under action, physical intelligence becomes less like instruction following and more like short-horizon simulation.
That is the bridge from semantic robots to physically intelligent machines.
World Models: The New Pre-Training Layer
A generative video model is trained to predict how pixels evolve over time. At the surface, that produces strange clips, broken geometry, and artifacts. Underneath, the model is being forced to learn useful approximations of physical dynamics.
A ball falls because gravity is statistically consistent.
Water bends light because refraction is visually consistent.
A reflective sphere preserves a warped version of its environment because reflections are structured.
The model is not solving physics equations. It is learning a latent approximation of how the visible world changes.
That is enough to make world models valuable for Physical AI.
For language models, pre-training created a broad simulator of language. For Physical AI, world modeling creates a broad simulator of physical futures.
The immediate value is prediction.
Given the current scene, what is likely to happen next?
The deeper value is counterfactual generation.
What would happen under different lighting, different object placement, different camera angle, different weather, different surface, different trajectory, different failure?
That is where models like Cosmos become strategically important.
A controllable world model can generate synthetic physical scenarios instead of waiting for every rare event to occur in the real world. It can expand the training set around edge cases: near misses, occlusions, unusual reflections, difficult grasps, smoke evolution, cluttered scenes, changing viewpoints, and unsafe states.
The generated clip is not the product.
The generated variation is the data.
World models become the pre-training layer for Physical AI because they give machines a way to learn before acting.
They create the raw material for the next step: action fine-tuning.
World-Action Models: Prediction Becomes Control
World models predict how a scene may evolve.
Physical AI needs one step more: predict how the scene evolves because of an action.
That is the shift Fan makes with DreamZero and World-Action Models.
A robot policy is no longer just:
observation → motor action
It becomes:
observation + candidate action → future world state + motor action
The model jointly predicts the next visual state and the next action trajectory. Motor commands become part of the same generative process as pixels. If the imagined future is coherent, the action is more likely to work. If the imagined future collapses, the action usually fails.
This is the bridge from world modeling to action fine-tuning.
The robot is not only recognizing the task. It is rolling forward a short physical future, choosing an action, observing the result, and correcting.
In the examples from Fan’s talk, the robot is not perfect. That is not the point. Like early GPT models, it is learning the shape of the behavior before it masters every case.
For Physical AI, this is the architectural jump:
see → predict
becomes
see → simulate action → act → update
That is how world models become useful machines.
The Data Problem: Physical AI Never Had an Internet
LLMs had web-scale text.
Physical AI does not have web-scale action.
That is the real data problem.
A language model can learn from billions of documents already sitting online. A physical model needs examples of hands, tools, objects, wheels, cameras, machines, failures, recoveries, and real-world context.
Teleoperation was the obvious first answer: put a human in the loop, control the robot, record the trajectory.
Useful, but not scalable.
A robot can only be teleoperated for so many hours. The operator is expensive. The setup is fragile. The robot itself becomes the bottleneck.
Fan’s chart shows the escape route.
Move the robot out of the data loop.
First: data wearables. Capture human motion directly through gloves, exoskeletons, sensors, and hand-tracking.
Then: egocentric video. Capture the world from the human point of view while real tasks happen naturally.
This is why driving became the first large-scale Physical AI use case.
Driving already had the right data flywheel: cameras, sensors, GPS, human actions, repeated routes, real-world edge cases, and millions of hours of naturally occurring behavior. The driver did not need to “collect data.” Driving itself became data collection.
Household robotics needs the same kind of flywheel.
That is why head-mounted cameras on cleaners are such a revealing example. A person cleaning a kitchen, folding laundry, organizing shelves, or washing dishes is not just performing a service. They are generating first-person physical task data: gaze, hand motion, object interaction, sequence, mistakes, recovery, and context.
The future data mix will likely use all three:
teleop for high-precision robot-specific alignment wearables for dexterous human motion egocentric video for scale and diversity fleet telemetry for real-world failures simulation for long-tail expansion
The key is not one perfect dataset.
The key is a data ladder.
Physical AI scales when data collection stops looking like a lab procedure and starts becoming ambient.
The Endgame: Physical APIs
Fan ends the talk with three milestones.
The first is the Physical Turing Test: machines performing useful physical work at a level where the output becomes comparable to human labor.
The second is the Physical API.
This is the bigger platform idea.
Once robots, drones, cameras, vehicles, and industrial machines can be orchestrated through software, physical work starts to look programmable.
A factory task becomes an API call. A drone inspection becomes an API call. A safety verification becomes an API call. A warehouse movement becomes an API call. A lab experiment becomes an API call.
The physical world becomes addressable by software.
World models create the physical prior. Simulation creates the training ground. Action models create executable behavior. Edge AI puts intelligence on machines. Fleet learning closes the loop. APIs turn individual machines into programmable infrastructure.
The third milestone is Physical Auto-Research: machines helping design, test, build, and improve the next generation of machines.
That is the endgame Fan is pointing toward.
The last wave of AI made knowledge work programmable.
This wave starts to make physical work programmable.
And once physical work becomes programmable, Physical AI stops being a robotics category.
It becomes a new computing platform.
