On Friday, Meta announced a preview of Movie Gen, a new set of AI models designed to create and manipulate video, audio, and images, including creating a realistic video from a single photo of a person. The company claims that the models outperform other video synthesis models when evaluated by humans, bringing us closer to a future where anyone can synthesize a full video of any subject on demand.
The company has no plans yet on when or how it will release these capabilities to the public, but Meta says Movie Gen is a tool that allows people to “expand their inherent creativity” rather than replacing human artists and animators. The company envisions future applications such as easily creating and editing “day in the life” videos for social media platforms or generating personalized, animated birthday wishes.
Movie Gen builds on Meta's previous work in video synthesis, following 2022's Make-A-Scene video generator and the Emu image synthesis model. Using text prompts as guidance, this latest system can generate custom videos with sound for the first time, edit and insert changes to existing videos, and convert images of people into realistic, personalized videos.
Meta isn't the only game in town when it comes to AI video synthesis. Google showed off a new model called “Veo” in May, and Meta says that in human preference tests, its Movie Gen outputs were better than OpenAI's Sora, Runway Gen-3 and the Chinese video model Kling.
Movie Gen's video generation model can create 1080p high-definition videos up to 16 seconds long and 16 frames per second from text descriptions or an image input. Meta claims that the model can handle complex concepts such as object motion, subject-object interactions, and camera movements.
Still, as we've seen with previous AI video generators, Movie Gen's ability to generate coherent scenes on a given topic likely depends on the concepts found in the sample videos Meta used to train its video synthesis model. It's worth keeping in mind that handpicked results from video generators often differ dramatically from typical results and getting a coherent result can take a lot of trial and error.