PLAI group members Will Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach (under the supervision of Dr. Frank Wood) have just released a paper on an astounding new deep generative model for video. Think OpenAI‘s GPT-3 but, instead of generating text given a prompt, their “Flexible Diffusion Model” (FDM) completes videos given a few frames of context. What is more, FDM (described in a recent arXiv paper entitled Flexible Diffusion Modeling of Long Videos) generates photorealistic, coherent long videos like this (128x128x45000).
Dr. Wood says “This is simply the most impressive AI result I have personally seen in my career. Long range coherence is a challenge even for modern language models with massive parameter counts. Will, Saeid, Vaden, and Christian have taken a huge step forward by being able to stably generate coherent, photo-realistic 1hour+ long videos; 70x’s longer than their longest training video, and more than 2000x’s longer than the maximum of 20 frames they ever look at at once during training. There is something very special in the training procedure they have developed and the architecture they employ. Never have we been closer to being able to formulate AI agents that plan visually in domains with life-like complexity.”
The team experimented with training their model on videos gathered in multiple environments, from the CARLA self-driving car simulator (provisioned via Inverted AI‘s Simulate cloud platform), a Minecraft reinforcement learning environment, and the Mazes environment from the DeepMind Lab suite. The video used to train FDM was collected from the first-person point of view as agents moved around in these environments. In CARLA a car drove around a single small town (Town 01), stopping at traffic lights. Otherwise it just cruised randomly in different weather conditions and at different times of day. In MineRL, video was collected from agents that moved more or less in straight lines through different MineCraft worlds to a goal block 64 blocks away. And in the Mazes environment agents moved from random starting positions to a random goal positions in procedurally generated maze worlds with brightly coloured and textured walls and floor.
Let’s see what FDM learned in each environment.
Here’s CARLA Town01:
Here’s MineRL:
Here’s Mazes
Whole books could be written about what we see, good and bad, in these example videos. In CARLA the video model sometimes jumps from one location to another distant location that looks similar. Traffic lights aren’t captured well. But the yellow lines and object constancy evidenced through building appearance around corners are encouraging. As are weather, shadows, and so forth. FDM arguably just has to memorize this small town, but the CARLA training dataset was about 11Gb of video data and FDM only has 78M parameters, so, some kind of generalization is happening. The MineRL environment makes this much more clear. Every training video in MineRL is a from a different world! This means that the visual futures FDM images reflect the MineCraft engine’s world generative model parameters as well — for instance how often hills are followed by plateaus with forest vs. valleys with villages and water. As FDM imagines MineCraft futures it is visually hallucinating entirely new worlds, obeying MineCraft blockiness and biome transition rules. The MineRL agent action space also includes block breaking. Look at the middle column which exhibits, in the video generative model, block breaking and mining! In Mazes we see nearly pixel perfect environment generation, but, when the agent clearly “returns” to the same place in the maze (to us), the world it generates there can be different. This means that semantic drift shows up visually here and indicates the necessity of some kind of complimentary memory system. The video is still coherent though, and the generative model, amazingly, does not diverge into blurriness or craziness, it just continues creating an ever changing maze-like world.
Here is one last thing this model can do. Using the CARLA Town 01 training data we trained a CNN to decode x,y map position from a single front view frame. This is possible because Town 01 is small and the view from each point in its map is reasonably distinct. We then generated a CARLA Town 01 video from FDM and fed the generated frames into this “place regressor.” The results from this are amazing.
We are working on improving goal-conditioned video generation for vision-based planning, integrating actions and rewards explicitly into FDM, and are conducting studies on FDM’s capabilities in more complex environments, looking particularly forward to seeing what happens when we add other agents into the CARLA environment and what happens when we include both more CARLA towns but also dash-cam video from the real world.