For her 38th birthday, Chela Robles and her family took a trek to One House, her favorite bakery in Benicia, California, for a brisket sandwich and brownies. On the car ride home, she tapped a small touchscreen on her temple and asked for a description of the outside world. “A cloudy sky,” came the answer through her Google Glass.
Robles lost the ability to see in her left eye when she was 28, and in her right eye a year later. Blindness, she says, denies you little details that help people connect, such as facial expressions and expressions. For example, her father tells a lot of dry jokes, so she’s not always sure if he’s serious. “If a picture can say more than 1,000 words, imagine how many words an expression can tell,” she says.
Robles has tried services in the past that put her in touch with sighted people for help. But in April, she signed up for a trial of Ask Envision, an AI assistant that uses OpenAI’s GPT-4, a multimodal model that can take images and text and perform conversational responses. The system is one of many tools for visually impaired people to begin integrating language models, and promises to give users much more visual detail about the world around them – and much more independence.
Envision launched in 2018 as a smartphone app for reading text in photos and on Google Glass in early 2021. Earlier this year, the company began testing an open source conversational model that could answer basic questions. Then Envision included OpenAI’s GPT-4 for image-to-text descriptions.
Be My Eyes, a 12-year-old app that helps users identify objects around them, went live with GPT-4 in March. Microsoft, a major investor in OpenAI, has begun testing GPT-4 integration for its SeeingAI service, which offers similar features, according to Microsoft chief AI lead Sarah Bird.
In the earlier iteration, Envision read text in an image from beginning to end. Now it can summarize text in a photo and answer follow-up questions. That means Ask Envision can now read a menu and answer questions about things like pricing, dietary restrictions, and dessert options.
Another early tester of Ask Envision, Richard Beardsley, says he mostly uses the service to find contact information on a bill or read ingredient lists on boxes of food, for example. With a hands-free option through Google Glass, he can use it while holding his guide dog’s leash and a cane. “Before, you couldn’t jump to a specific part of the text,” he says. “Having this really makes life a lot easier because you can jump to exactly what you’re looking for.”
Integrating AI into products with a blind eye could have major implications for users, says Sina Bahram, a blind computer scientist and head of a consulting firm that advises museums, theme parks, and tech companies like Google and Microsoft on accessibility and inclusion.
Bahram uses Be My Eyes with GPT-4 and says the large language model makes an “orders of magnitude” difference over previous generations of technology because of its capabilities, and because products are effortless to use and require no technical skills. Two weeks ago, he says, he was walking down the street in New York City when his business partner stopped to take a closer look. Bahram used Be My Eyes with GPT-4 only to find out it was a collection of stickers, some cartoonish, plus some text, some graffiti. This level of information is “something that didn’t exist outside the lab a year ago,” he says. “It just couldn’t.”