Researchers at Google DeepMind, in collaboration with UC Berkeley, MIT, and the University of Alberta, have made significant progress in the field of artificial intelligence (AI) by developing a groundbreaking machine learning model called UniSim. This model aims to create realistic simulations for training various AI systems, marking a significant milestone in generative models.
The Potential of UniSim: Simulating Real-World Interactions
UniSim is a generative AI system that functions as a “universal simulator of real-world interaction.” The objective of UniSim is to simulate realistic experiences in response to actions taken by humans, robots, and other interactive agents. This innovation has far-reaching implications for fields that rely on intricate real-world interactions, such as robotics and autonomous vehicles.
UniSim is capable of simulating both high-level instructions, like “open the drawer,” and low-level controls, such as “move by x, y,” thus effectively mimicking the interaction between humans and agents. The simulated data generated by UniSim serves as valuable training examples for other models that require data collection from the real world.
“We propose to combine a wealth of data—ranging from internet text-image pairs, to motion and action rich data from navigation, manipulation, human activities, robotics, and data from simulations and renderings—in a conditional video generation framework.”
– Google DeepMind researchers
UniSim’s ability to merge diverse data sources and generalize beyond its training examples enables rich interaction through fine-grained motion control of otherwise static scenes and objects. This sets UniSim apart as a versatile tool that can be used to train embodied planners, low-level control policies, video captioning models, and other machine learning models that demand high-quality and consistent visual data.
Addressing Data Format Diversity Challenges
The researchers encountered challenges training the UniSim model due to the diversity of data formats. Different datasets curated by various industrial and research communities for different tasks posed difficulties in building a real-world simulator that accurately captures the true essence of realistic experiences.
“Since different datasets are curated by different industrial or research communities for different tasks, divergence in information is natural and hard to overcome, posing difficulties to building a real-world simulator that seeks to capture realistic experience of the world we live in.”
– Google DeepMind researchers
To overcome this challenge, the researchers converted all the disparate datasets into a unified format. Transformer models, a type of deep learning architecture commonly used in language models, were employed to create embeddings from text descriptions, motor controls, and camera angles. A diffusion model was then trained to encode the visual observations relating to the actions, with conditioning applied to the embeddings to connect observations, actions, and outcomes.
The resulting UniSim model can generate a wide range of photorealistic videos, depicting people performing actions and navigating environments. It can execute long-horizon simulations, such as a robot hand performing multiple sequential actions. Notably, UniSim can also generate stochastic environment transitions, such as revealing hidden objects under a cloth or towel, making it an invaluable tool for simulating counterfactuals and different scenarios in computer vision applications.
UniSim’s Integration with Reinforcement Learning Environments
UniSim’s remarkable ability to generate realistic videos from text descriptions is just the tip of the iceberg. Its true value lies in its integration with reinforcement learning environments. UniSim can simulate various outcomes in applications like robotics, facilitating offline training of models and agents without real-world dependencies.
“Using UniSim as an environment to train policies has a few advantages including unlimited environment access (through parallelizable video servers), real-world-like observations (through photorealistic diffusion outputs), and flexible temporal control frequencies (through temporally extended actions across low-level robot controls and high-level text actions).”
– Google DeepMind researchers
Simulation environments are widely used in reinforcement learning, and UniSim’s high visual quality helps bridge the gap between learning in simulation and in the real world (known as the “sim-to-real gap”). Models trained with UniSim can generalize to real robot settings in a zero-shot manner, making significant strides towards overcoming the challenges associated with embodied learning.
The potential applications of UniSim are vast. From controllable content creation in games and movies to training embodied agents purely in simulation for deployment in the real world, UniSim opens up new possibilities. It is particularly beneficial for vision language models (VLM) like DeepMind’s recent RT-X models, which require substantial real-world data to execute complex, multi-step tasks. UniSim can generate large volumes of training data for VLM policies and extend its benefits to other models, such as video captioning models.
Additionally, UniSim’s ability to simulate rare events is immensely valuable in applications involving robotics and self-driving cars, where data collection can be costly and risky.
“UniSim requires large compute resources to train similar to other modern foundation models. Despite this disadvantage, we hope UniSim will instigate broad interest in learning and applying real-world simulators to improve machine intelligence.”
– Google DeepMind researchers