On Tuesday, Elon Musk’s AI company xAI announced the beta release of two new language models, Grok-2 and Grok-2 mini, which are available to subscribers of his social media platform X (formerly Twitter). The models also pair with the recently released Flux image synthesis model, which allows X users to create largely uncensored photorealistic images that can be shared on the site.
“Flux, accessible via Grok, is an excellent text-to-image generator, but it’s also very good at making fake photos of real locations and people and sending them directly to Twitter,” wrote frequent AI commentator Ethan Mollick on X. “Does anyone know if they watermark these in any way? That would be a good idea.”
In a report published earlier today, The Verge noted that Grok’s image generation capabilities appear to have minimal safeguards, potentially allowing users to create controversial content. According to their tests, Grok produced images of political figures in compromising situations, copyrighted characters, and violent scenes when requested.
The Verge found that while Grok claims to have certain restrictions, such as avoiding pornographic or excessively violent content, these rules appear inconsistent in practice. Unlike other major AI image generators, Grok does not appear to reject prompts featuring real people or add identifying watermarks to its output.
Given what people are generating so far — including footage of Donald Trump and Kamala Harris kissing or giving a thumbs-up on their way to the Twin Towers in an apparent 9/11 attack — the unlimited output may not last long. Then again, Elon Musk has made a big deal about “free speech” on his platform, so perhaps the possibility will remain (until someone files a libel or copyright lawsuit, presumably).
People using Grok’s image generator to shock people at this point raise an old question in AI: Should the misuse of an AI image generator be the responsibility of the person creating the prompt, the organization that created the AI model, or the platform hosting the images? So far, there’s no clear consensus, and the situation has yet to be legally resolved, though a new proposed U.S. law called the NO FAKES Act X would presumably hold people liable for creating realistic image deepfakes.
With Grok-2, the GPT-4 ceiling is maintained
Looking beyond the images, xAI claims in a release blog that the Grok-2 and Grok-2 mini represent significant improvements in capabilities, with Grok-2 reportedly outperforming several leading competitors in recent benchmarks and what we’re calling “vibemarks.” It’s always wise to approach such claims with a dose of skepticism, but it does seem that while the “GPT-4 class” of AI language models (those with capabilities similar to OpenAI’s model) may be getting larger, the GPT-4 barrier has yet to be broken.
“There are now five GPT-4 class models: GPT-4o, Claude 3.5, Gemini 1.5, Llama 3.1, and now Grok 2,” wrote Ethan Mollick on X. “All the labs say there is room for continued massive improvements, but we haven't seen a single model that really stands out above GPT-4… yet.”
xAI says it recently introduced an early version of Grok-2 to the LMSYS Chatbot Arena under the name “sus-column-r”, where it reportedly achieved a higher overall Elo score than models like Claude 3.5 Sonnet and GPT-4 Turbo. Chatbot Arena is a popular subjective vibemarking website for AI models, but it has recently been the subject of controversy after people took issue with OpenAI's GPT-4o mini's high placement in the rankings.
According to xAI, both new Grok models show improvements over their predecessor Grok-1.5 in areas such as university-level scientific knowledge, general knowledge, and mathematical problem solving in benchmarks that have also proven controversial. The company also highlighted Grok-2's performance on visual tasks, claiming state-of-the-art results in visual mathematical reasoning and document-based question answering.
The models are now available to X Premium and Premium+ subscribers via an updated app interface. Unlike some competitors in the open weights space, xAI does not release the model weights for download or independent verification. This closed approach stands in stark contrast to recent moves by Meta, which recently released its Llama 3.1 405B model for anyone to download and run locally.
xAI plans to release both models via an enterprise API later this month. The company says this API will include multi-regional deployment options and security measures such as mandatory multi-factor authentication. Details on pricing, usage limits, or data handling policies have not yet been disclosed.
Aside from the photorealistic image generation, Grok-2's biggest weakness may be its deep link to X, which makes it prone to pulling incorrect information from tweets. It's a bit like having a friend who insists on checking the social media site before answering one of your questions, even if it's not really relevant.
As Mollick noted on X, this tight link can be annoying: “I only have access to Grok 2 mini right now, and it seems like a solid model, but often seems poorly served by the RAG connection to Twitter,” he wrote. “The model gets results from Twitter that seem irrelevant to the prompt, and then desperately tries to connect them into something coherent.”