Skip to content

We made a cat drink beer with Runway's AI video generator, and hands came out

    A screenshot of an AI-generated video of a cat drinking a can of beer, created by Runway Gen-3 Alpha.

    In June, Runway debuted a new text-to-video synthesis model called Gen-3 Alpha. It converts written descriptions, called “prompts,” into HD video clips with no sound. We’ve had a chance to use it and wanted to share our results. Our tests show that careful prompting isn’t as important as matching concepts that are likely to occur in the training data, and that achieving funny results likely requires many generations and selective cherry-picking.

    An enduring theme of all the generative AI models we’ve seen since 2022 is that they can be excellent at blending concepts found in training data, but are typically very poor at generalization (applying learned “knowledge” to new situations on which the model was not explicitly trained). That means they can excel at stylistic and thematic novelty, but struggle with fundamental structural novelty that goes beyond the training data.

    What does all this mean? In the case of Runway Gen-3, the lack of generalization means that you could ask for a sailing ship in a spinning cup of coffee, and provided that the Gen-3 training data contains video examples of sailing ships and spinning coffee, that's an “easy” new combination for the model to make reasonably convincing. But if you ask for a cat drinking a can of beer (in a beer commercial), it will generally fail because there probably aren't many videos of photorealistic cats drinking human beverages in the training data. Instead, the model will draw on what it has learned about cat videos and beer commercial videos and combine the two. The result is a cat with human hands chugging back a beer.

    A few basic questions

    During the Gen-3 Alpha testing phase, we signed up for Runway’s Standard plan, which offers 625 credits for $15 per month, plus some free trial credits. Each generation costs 10 credits per second of video, and we shot 10-second videos for 100 credits each. So the number of generations we could create was limited.

    We first tried a few standards from our past image synthesis tests, like cats drinking beer, barbarians with CRT TVs, and queens of the universe. We also explored Ars Technica lore with the “moonshark”, our mascot. You can see all those results and more below.

    We had so few credits that we couldn't afford to re-run them and selectively pick from them, so what you see in each prompt is the exact same generation that we got from Runway.

    “A very intelligent person is reading “Ars Technica” on his computer when the screen explodes”

    “advertisement for a new flaming cheeseburger from McDonald's”

    “The moon shark that jumps out of a computer screen and attacks a person”

    “A cat in a car drinking a can of beer, beer commercial”

    “Will Smith eats spaghetti” activated a filter, so we tried “a black man eating spaghetti.” (Watch to the end.)

    “Robotic humanoid animals in vaudeville costumes roam the streets collecting protection money in tokens”

    “A basketball player in a ghost train car with a basketball court, and he plays against a team of ghosts”

    “A herd of a million cats running on a hill, aerial photo”

    “video game footage from a dynamic third-person 3D platformer from the 1990s starring an anthropomorphic shark boy”