Video scraping is just one of many new tricks possible when the latest major language models (LLMs), such as Google's Gemini and GPT-4o, are actually 'multimodal' models, allowing audio, video, image and text input to make. These models translate all multimedia input into tokens (chunks of data), which they use to make predictions about which tokens should be next in a sequence.
A term like “token forecast model” (TPM) may be more accurate than “LLM” these days for AI models with multimodal inputs and outputs, but a common alternative term hasn't really taken off yet. But whatever you call it, having an AI model that can process video input has interesting implications, both good and potentially bad.
Breaking input barriers
Willison is far from the first person to input video into AI models to achieve interesting results (more on that below, and here's a 2015 paper using the term “video scraping”), but once Gemini launched its video input capability , started experimenting with it seriously.
In February, Willison demonstrated another early use of AI video scraping on his blog, where he created a seven-second video of the books on his bookshelves and then had Gemini 1.5 Pro extract all the book titles he saw in the video and include them in the video put. in a structured or organized list.
Converting unstructured data to structured data is important to Willison, because he is also a data journalist. Willison has created tools for data journalists in the past, such as the Datasette project, which allows anyone to publish data as an interactive website.
To the frustration of any data journalist, some data sources prove resistant to scraping (the capturing of data for analysis) because of the way the data is formatted, stored or presented. In these cases, Willison is pleased with the capabilities of AI video scraping as it bypasses these traditional barriers to data extraction.