If you could generate text, if you could generate images, can you also generate motion? The answer is probably yes. And then if you can generate motion, you can understand intent and generate a generalized version of articulation. Therefore, humanoid robotics should be right around the corner.

And I think the work around state-space models, or SSMs, that allow you to learn extremely long patterns and sequences without growing quadratically in computation, probably is the next transformer.

Jensen Huang