Stability AI Announces Release of Stable Video Diffusion

OpenAI is celebrating the return of Sam Altman, while its competitors are stepping up their game in the AI race. Anthropic has recently released Claude 2.1, and Adobe is reportedly acquiring Now, Stability AI joins the race with the announcement of their latest release, Stable Video Diffusion.

Stable Video Diffusion (SVD) introduces two cutting-edge AI models – SVD and SVD-XT – designed to generate short clips from images. According to Stability AI, these models produce high-quality outputs that rival or even surpass other AI video generators on the market.

For research purposes only, Stability AI is open-sourcing the image-to-video models as part of its research preview. The company plans to gather user feedback to further improve and refine the models, with the ultimate goal of commercial application.

Understanding Stable Video Diffusion

SVD and SVD-XT are latent diffusion models that take a still image as a conditioning frame and generate 576 X 1024 videos from it. The models operate at speeds ranging from three to 30 frames per second, but the resulting videos are relatively short, lasting only up to four seconds. The SVD model produces 14 frames from stills, while SVD-XT extends this to 25 frames.

To develop Stable Video Diffusion, Stability AI utilized a large, carefully curated video dataset consisting of approximately 600 million samples. They trained a base model using this dataset, which was then fine-tuned on a smaller, high-quality dataset containing up to a million clips. This process enabled the models to handle tasks such as text-to-video and image-to-video, accurately predicting a sequence of frames from a single conditioning image.

In a whitepaper detailing SVD, Stability AI stated that the model can serve as a foundation for fine-tuning a diffusion model capable of multi-view synthesis. This would allow generating multiple consistent views of an object using just one still image.

Potential Applications and Current Limitations

Stability AI believes that Stable Video Diffusion has the potential for a wide range of applications across sectors such as advertising, education, and entertainment. In an external evaluation, SVD outputs were praised for their high quality, surpassing leading closed text-to-video models from Runway and Pika Labs.

However, Stability AI acknowledges that the models are still a work in progress and are far from perfect. There are instances where the models fall short in delivering photorealism, generating videos without motion or with very slow camera pans, and failing to generate recognizable faces and people.

To address these limitations, Stability AI plans to refine both models based on user feedback gathered during the research preview phase. The company aims to eliminate present gaps and introduce new features, such as support for text prompts or text rendering in videos, for commercial applications.

The current release of Stable Video Diffusion is primarily intended to invite open investigation of the models, allowing for the identification of potential issues, including biases, to ensure safe deployment in the future. Stability AI expressed its plans to develop a variety of models that build on and expand this base, creating an ecosystem similar to stable diffusion.

To facilitate user engagement, Stability AI is encouraging users to sign up for an upcoming web experience that will enable video generation from text. However, the specific release date for this experience has not yet been announced.

For those interested in exploring the Stable Video Diffusion models, the code can be found on Stability AI’s GitHub repository, and the required weights to run the model locally are available on its Hugging Face page. Usage of the models is permitted once the company’s terms have been accepted, which outline both allowed and excluded applications.

At present, in addition to research and exploration purposes, permitted use cases for the models include generating artworks for design and other creative processes, as well as educational applications. However, Stability AI emphasizes that generating factual or “true representations of people or events” is currently outside the scope of their models.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts