The most capable open source AI model with visual capabilities, but there could still be more developers, researchers and startups developing AI agents that can perform useful chores on your computers for you.
The Multimodal Open Language Model, or Molmo, released today by the Allen Institute for AI (Ai2), can both interpret images and converse through a chat interface. This means it can understand a computer screen and potentially help an AI agent perform tasks such as browsing the web, navigating file folders, and drafting documents.
“With this release, many more people can deploy a multimodal model,” said Ali Farhadi, CEO of Ai2, a research organization based in Seattle, Washington, and a computer scientist at the University of Washington. “It should be an enabler for next-generation apps.”
So-called AI agents are widely touted as the next big thing in AI, with OpenAI, Google and others rushing to develop them. Agents have become a buzzword lately, but the big vision is for AI to go far beyond chatting and reliably perform complex and advanced actions on computers when commanded. This possibility has yet to materialize on any scale.
Some powerful AI models already have visual capabilities, including OpenAI's GPT-4, Anthropic's Claude, and Google DeepMind's Gemini. These models can be used to power some experimental AI agents, but they are hidden from view and only accessible through a paid application programming interface (API).
Meta has released a family of AI models called Llama under a license that limits their commercial use, but has yet to offer developers a multimodal version. Meta is expected to announce several new products at today's Connect event, which could include new Llama AI models.
“Having an open source, multimodal model means that any startup or researcher who has an idea can try to make it happen,” says Ofir Press, a postdoc at Princeton University who studies AI agents.
The press says that the fact that Molmo is open source means that developers can more easily tailor their agents to specific tasks, such as working with spreadsheets, by providing additional training data. Models like GPT-4 can only be refined to a limited extent via their APIs, while a fully open model can be extensively customized. “When you have an open source model like that, you have a lot more options,” says Press.
Ai2 today is releasing Molmo in several formats, including a 70 billion parameter model and a 1 billion parameter model small enough to run on a mobile device. The number of parameters of a model refers to the number of units it contains for storing and manipulating data and roughly corresponds to its capabilities.
Ai2 says that despite its relatively small size, Molmo is just as capable as significantly larger commercial models because it is carefully trained on high-quality data. The new model is also completely open source, in the sense that, unlike Meta's Llama, there are no restrictions on its use. Ai2 is also releasing the training data used to create the model, giving researchers more details about how it works.
Releasing powerful models is not without risk. Such models can be more easily adapted for nefarious purposes; For example, we may one day see the rise of malicious AI agents designed to automate the hacking of computer systems.
Ai2's Farhadi says Molmo's efficiency and portability will allow developers to build more powerful software agents that run natively on smartphones and other wearable devices. “The billion-parameter model now performs at the level or in the class of models at least ten times larger,” he says.
However, building useful AI agents may depend on more than just more efficient multimodal models. An important challenge is to make the models work more reliably. This could well require further breakthroughs in AI reasoning capabilities – something OpenAI has attempted to address with its latest model o1, which demonstrates step-by-step reasoning skills. The next step could be to give multimodal models such reasoning skills.
For now, Molmo's release means that AI agents are closer than ever – and could soon even be useful beyond the giants that rule the world of AI.