Lifelong Learning of Video Diffusion Models From a Single Video Stream

Lifelong Learning of Video Diffusion Models From a Single Video Stream

PLAI group member Jason Yoo and colleagues, under the supervision of Dr. Frank Wood and Dr. Geoff Pleiss, have released a new paper on training autoregressive video diffusion models from a continuous video stream that outputs one video frame at a time. The AI community has long sought models and algorithms that learn in a fundamentally human way; from birth to death, learning as we live. Our paper demonstrates that learning video diffusion models in such a way is not only possible but remarkably can also be competitive with standard offline training approaches given the same number of gradient steps. In addition, our paper introduces three new lifelong video generative modeling datasets generated from synthetic environments of increasing complexity: Lifelong Bouncing Balls, Lifelong 3D Maze, and Lifelong PLAICraft.

Figure 1: Ground truth video frames (top row) and a lifelong learned video diffusion model’s generated video frames (middle and bottom rows) for the Lifelong PLAICraft dataset. The model is lifelong learned from a 50-hour Minecraft video using experience replay. Given the 10 initial frames marked by red borders, the model produces the next 20 frames. Despite the model’s limited parameter count of 80 million, the generated videos are diverse and closely resemble Minecraft gameplay.

How Are the Models Lifelong Learned?

In standard offline learning, video diffusion models typically train on independently and identically distributed (i.i.d.) sampled video frames from a large dataset of loosely related videos. In our lifelong learning setup, video diffusion models are trained online on a video stream that sequentially iterates through a single, very long video. At each training iteration, the video diffusion models observe one new video frame and take one gradient step.

The models’ task is to predict the future video frames conditioned on the preceding video frames. Our lifelong learning setup trains the models using a sliding window scheme. At training step t, the model conditions on a fixed number of most recent video frames from the video stream and learns to denoise the subsequent video frames. At training step t+1, the model’s context window slides by one video frame and the same procedure repeats indefinitely. This process is illustrated in the figure below.

Lifelong learning of video diffusion models from a single video stream. At training step t, the model conditions on two frames in the first half of its context window (red) and learns to denoise two frames in the second half of its context window (blue). At training step t+1, the model’s context window shifts right by one video frame, and the same procedure repeats indefinitely.

Unsurprisingly, performing SGD on the minibatch solely comprised of the current timestep sliding window video frames leads to suboptimal performance. Therefore, we augment the minibatch with past timestep sliding window video frames that are saved in the replay buffer—a technique commonly known as experience replay. While our paper’s lifelong learning results are based on experience replay, we note that this training setup is compatible with other lifelong learning algorithms.

Datasets and Model Samples

As no prior work has attempted to lifelong learn video models on a continuous video stream, we introduce and experiment with three new video lifelong learning datasets: Lifelong Bouncing Balls, Lifelong 3D Maze, and Lifelong PLAICraft. Each dataset contains over a million video frames derived from a single video and is designed to test how data stream characteristics such as perceptual complexity, frame repetitiveness, rare events, and nonstationarity affect lifelong learning. Our video datasets present novel opportunities to learn video models in learning regimes one step closer to that of biological agents. We now briefly elaborate on each dataset and showcase the lifelong learned video diffusion model samples.

Lifelong Bouncing Balls

Figure 2: Ground truth video frames (top row) and a lifelong learned video diffusion model’s generated video frames
(middle and bottom rows) for the Lifelong Bouncing Balls datasets. Given the 10 initial video frames marked by red borders,
the model produces the next 40 frames.

 

Lifelong Bouncing Balls is the simplest of the three datasets. It contains 1 million 32×32 RGB video frames that depict two colored balls that deterministically bounce around and change colors for 28 hours. There are two versions of the dataset where the ball colors do and do not irreversibly change throughout the video stream to assess the effect of frame detail repetitiveness on video lifelong learning. These two versions are depicted in the left and right subfigures of Figure 2. Video diffusion models lifelong learned with experience replay generate videos with realistic ball motion and color transitions.

Lifelong 3D Maze

Figure 3: Ground truth video frames (top row) and a lifelong learned video diffusion model’s generated video frames
(middle and bottom rows) for the Lifelong 3D Maze dataset. Given the 10 initial video frames marked by red borders,
the model produces the next 40 frames.

Lifelong 3D Maze contains 1 million 64×64 RGB video frames that depict a first-person view of an agent that navigates a 3D maze for 14 hours (if the maze feels familiar to you, it is because the maze was one of the Windows 95 screensavers). The maze is randomly generated and contains various sparsely occurring objects such as polyhedral rocks that flip the agent and smiley faces that regenerate the maze. Video diffusion models lifelong learned with experience replay generate realistic maze traversal footages that correctly handle rare events.

Lifelong PLAICraft

Figure 4: Ground truth video frames (top row) and a lifelong learned video diffusion model’s generated video frames
(middle and bottom rows) for the Lifelong PLAICraft dataset. Given the 10 initial video frames marked by red borders,
the model produces the next 20 frames.

Lifelong PLAICraft is the most complex of the three datasets. It contains 1.85 million 1280×768 RGB video frames that depict a first-person view of an anonymous player who plays multiplayer Minecraft survival world for 54 hours. The video stream captures continuous play sessions from the PLAICraft project and contains clips featuring various biomes, mining, crafting activities, construction, mob fighting, and player-to-player interactions. Thus, the video stream is highly nonstationary and its characteristics change over time in multiple timescales (ex. day-night cycle vs the player sporadically visiting their home). Video diffusion models lifelong learned with experience replay on the Stable Diffusion-encoded video frames successfully capture perceptual details of the Minecraft video frames, in particular details associated with objects present in every gameplay frame (ex. player name, item bar, equipped item). Interestingly, the model also captures player-like behaviors such as spontaneously opening the user inventory (Figure 1 bottom row, leftmost column) and the in-game chat interface (Figure 4 middle row, rightmost column).

Final Remarks

We are genuinely excited for the future of video model lifelong learning. Our findings show that moderate-sized video diffusion models, lifelong learned on just two days’ worth of video frames, can generate short and plausible videos of challenging environments like Minecraft. Looking ahead, we hypothesize that large video diffusion models lifelong learned on years’ worth of video frames could unlock the ability to generate long and temporally coherent videos of highly complex environments. As video modeling is a key component of many world models such as GameNGen and Oasis, these advancements could pave the way for new life-like approaches to learning, planning, and control in embodied AI agents. We are eager to see where this journey leads and invite you to check out our full paper for additional details and analysis. Thank you for reading!

plaicraft.ai launch

We are proud to announce that UBC’s Behavioral Research Ethics Board has issued a certificate of approval under the minimal risk category for us to publicly release plaicraft.ai, a “free Minecraft in the cloud” generative AI research data collection project.  Please consider contributing by signing up and playing Minecraft in your browser at www.plaicraft.ai.  

Our audacious but achievable goal is to collect over 10,000 hours of multiplayer Minecraft gameplay and then to use this data to train AGI-like agents that can respond sensibly in video and audio perceptual environments.  No more dumb NPCs!

Visual Chain-of-Thought Diffusion Models

Images generated by our baseline, EDM. They mostly look realistic but there are occasionally artifacts – see the blobs on the chin in the first and seventh images.

Images generated by our method. We don’t see any of the artifacts that were present in images from the baseline.

At this year’s CVPR workshop on Generative Models for Computer Vision we’ll present a simple new approach to unconditional and class-conditional image generation. It takes advantage of this fact: conditional diffusion generative models (DGMs) produce much more realistic images than unconditional DGMs. We show in the paper that images produced by conditional DGMs even get more realistic as you condition on more information. This even holds true if you add information by making your text prompt longer for Stable diffusion (see the images we sample from Stable diffusion at the end). If we want to generate a large set of images (or a video), it seems like we have to either (a) start by writing out a detailed description of each image or frame, or (b) accept inferior quality.

Our paper proposes a third option: prompt a first DGM to generate a detailed image description, and then prompt a conditional DGM to generate the image given this detailed description. To avoid the cost of generating long paragraphs of text, we use a vector in the form of a CLIP image embedding for the image description. A CLIP image embedding is a vector that encodes the semantically-meaningful parts of an image in a compact format. Let’s look at some images sampled conditioned on CLIP embeddings:

Animal faces. We sampled every frame in this video using the same set of 20 CLIP embeddings so high-level features (like animal species) are shared across all frames.

Human faces. We sampled every frame in this video using the same set of 20 CLIP embeddings so, once again, high-level features are shared across frames.

We see that two images conditioned on the same CLIP embedding share a lot of the same features: the animals’ species and color patters stay the same; the people’s age, their facial expression, and their accessories stay roughly the same. Even better, these images sampled from a conditional DGM are much more realistic than those sampled from an unconditional DGM: if we take a CLIP embedding of an image from the animal faces dataset and then sample from our conditional DGM, the resulting image is on average 56% more realistic than if we’d used an unconditional DGM (according to the Fréchet Inception Distance, a commonly-used measure of image quality).

Now, how well does the conditional DGM work as part of our proposed method, when we prompt it with a CLIP embedding generated by a second DGM? We find that our generated images are still 48% more realistic than those from an unconditional DGM, almost as good as when we “cheated” by taking CLIP embeddings of dataset images. In summary, even though our task is unconditional generation, we can make use of conditional DGMs, which typically make better-looking images than unconditional DGMs!

We’re excited about future work following this direction. It’s likely that there are better quantities to condition on than CLIP embeddings – we have so far tried just a couple of alternatives. We might even be able to learn an embedder directly to maximize our image quality – doing so could lead us to a generalization of Variational Diffusion Models. We could also condition on multiple quantities – perhaps a future state-of-the-art generative model will consist of a “chain” of DGMs, each conditioning on the output of the one before. If this sounds too complex, an alternative is to simplify our method by learning a single DGM that jointly generates an image and CLIP embedding. See our paper for the full details!

An unrealistic image featuring a distorted road. We prompted Stable diffusion to generate “Aerial photography.

A more realistic image. We prompted Stable diffusion to generate “Aerial photography of a patchwork of small green fields separated by brown dirt tracks between them. A large tarmac road passes through the scene from left to right.