Cade Metz has been writing about advancements in artificial intelligence for over a decade.
Ian Sansavera, a software architect at a New York start-up called Runway AI, typed out a short description of what he wanted to see in a video. “A quiet river in the woods,” he wrote.
Less than two minutes later, an experimental internet service generated a short video of a calm river in a forest. The river’s rushing water glinted in the sun as it sliced through trees and ferns, turned a corner, and splashed softly over rocks.
Runway, which plans to open its service to a small group of testers this week, is one of several companies building artificial intelligence technology that will soon allow people to generate videos simply by typing a few words into a box on a computer screen.
They represent the next stage in an industry race – one that involves giants like Microsoft and Google as well as many smaller start-ups – to create new types of artificial intelligence systems that some believe could be the next big thing in technology. just as important as web browsers or the iPhone.
The new video generation systems could speed up the work of filmmakers and other digital artists while becoming a new and fast way to create hard-to-detect online misinformation, making it even harder to tell what’s real on the internet.
The systems are examples of what is known as generative AI, which can create text, images and sounds on the fly. Another example is ChatGPT, the online chatbot created by OpenAI, a San Francisco start-up, that stunned the tech industry with its capabilities late last year.
Google and Meta, Facebook’s parent company, unveiled the first video generation systems last year, but didn’t share them with the public because they feared the systems could eventually be used to spread disinformation with newfound speed and efficiency.
But Runway CEO Cris Valenzuela said he believed the technology was too important to keep in a research lab despite the risks. “This is one of the most impressive technologies we’ve built in the last hundred years,” he said. “You have to get people to actually use it.”
The ability to edit and manipulate film and video is, of course, nothing new. Filmmakers have been doing it for over a century. In recent years, researchers and digital artists have used various AI technologies and software programs to create and edit videos that are often referred to as deepfake videos.
But systems like the one Runway created could, over time, replace editing skills at the touch of a button.
A new generation of chatbots
A brave new world. A new breed of chatbots powered by artificial intelligence has sparked a battle to determine whether the technology can turn the internet’s economics upside down, turning current powerhouses into has-beens and creating the industry’s next giants. These are the bots you should know about:
Runway’s technology generates videos of any short description. To get started, simply type a description as you would type a quick note.
This works best if the scene has some action but not too much action, such as “a rainy day in the big city” or “a dog with a cell phone in the park.” Press Enter and the system will generate a video in a minute or two.
The technology can reproduce ordinary images, such as a cat sleeping on a rug. Or it can combine disparate concepts to generate videos that are oddly funny, like a cow at a birthday party.
The videos are only four seconds long and the video is choppy and blurry if you look closely. Sometimes the images are weird, distorted and disturbing. The system has a way of merging animals like dogs and cats with inanimate objects like balls and cell phones. But when asked the right way, it produces videos that show where the technology is headed.
“If I see a high-resolution video right now, I’m probably going to trust it,” said Phillip Isola, a Massachusetts Institute of Technology professor who specializes in AI. “But that will change pretty soon.”
Like other generative AI technologies, Runaway’s system learns by analyzing digital data — in this case, photos, videos, and captions describing what those images contain. By training this kind of technology on ever-increasing amounts of data, researchers are confident that they can quickly improve and expand their skills. Experts believe they will soon be making professional-looking mini-movies complete with music and dialogue.
It is difficult to define what the system is currently creating. It’s not a photo. It’s not a cartoon. It’s a collection of loads of pixels mixed together to make a realistic video. The company plans to offer its technology with other tools it believes will accelerate the work of professional artists.
Last month, social media services were teeming with images of Pope Francis in a white Balenciaga down jacket — surprisingly trendy attire for an 86-year-old pope. But the images were not real. A 31-year-old construction worker from Chicago had created the viral sensation using a popular AI tool called Midjourney.
Dr. Isola spent years building and testing this kind of technology, first as a researcher at the University of California, Berkeley, and OpenAI, and then as a professor at MIT. Yet he was fooled by the sharp, high-resolution but completely fake images of Pope Francis.
“There was a time when people posted deepfakes, and they didn’t make fun of me because they were so bizarre or not very realistic,” he said. “Now we can’t take any of the images we see on the internet at face value.”
Midjourney is one of many services that can generate realistic still images from a short prompt. Others include Stable Diffusion and DALL-E, an OpenAI technology that kickstarted this wave of photogenerators when it was unveiled a year ago.
Midjourney relies on a neural network, which learns its skills by analyzing massive amounts of data. It looks for patterns as it combs through millions of digital images, as well as text captions describing what each image represents.
When someone describes an image to the system, it generates a list of features that the image could contain. One feature may be the curvature at the top of a dog’s ear. Another could be the edge of a cell phone. Then a second neural network called a diffusion model creates the image and generates the pixels needed for the features. Ultimately, it transforms the pixels into a cohesive image.
Companies like Runway, which has about 40 employees and raised $95.5 million, are using this technique to generate moving images. By analyzing thousands of videos, their technology can learn to string together many still images in an equally coherent way.
“A video is just a series of frames – still images – that are combined in such a way as to give the illusion of movement,” said Mr. Valencia. “The trick lies in training a model that understands the relationship and consistency between each frame.”
Like early versions of tools like DALL-E and Midjourney, the technology sometimes mixes concepts and images in curious ways. If you ask for a teddy bear that plays basketball, you might get some sort of mutated stuffed animal with a basketball for a hand. If you ask for a dog with a cellphone at the park, you might get a cellphone-wielding pup with a strange human body.
But experts believe they can eliminate the errors if they train their systems on more and more data. They believe technology will eventually make video making as easy as writing a sentence.
“In the past, to do something like that remotely, you had to have a camera. You had to have props. You had to have a location. You had to have permission. You had to have money,” says Susan Bonser, a Pennsylvania author and publisher who has experimented with early incarnations of generative video technology. “You don’t need all that now. You can just sit in front of it and imagine it.”