TalkBack makes use of Gemini Nano to extend picture accessibility for customers with low imaginative and prescient

TalkBack makes use of Gemini Nano to extend picture accessibility for customers with low imaginative and prescient
TalkBack makes use of Gemini Nano to extend picture accessibility for customers with low imaginative and prescient



Posted by Terence Zhang – Developer Relations Engineer and Lisie Lillianfeld – Product Supervisor

TalkBack makes use of Gemini Nano to extend picture accessibility for customers with low imaginative and prescient

TalkBack is Android’s display screen reader within the Android Accessibility Suite that describes textual content and pictures for Android customers who’ve blindness or low imaginative and prescient. The TalkBack group is at all times working to make Android extra accessible. In the present day, due to Gemini Nano with multimodality, TalkBack mechanically supplies customers with blindness or low imaginative and prescient extra vivid and detailed picture descriptions to higher perceive the pictures on their display screen.

Growing accessibility utilizing Gemini Nano with multimodality

Advancing accessibility is a core a part of Google’s mission to construct for everybody. That’s why TalkBack has a characteristic to explain pictures when builders didn’t embody descriptive alt textual content. This characteristic was powered by a small ML mannequin referred to as Garcon. Nonetheless, Garcon produced brief, generic responses and couldn’t specify related particulars like landmarks or merchandise.

The event of Gemini Nano with multimodality was the right alternative to make use of the most recent AI know-how to extend accessibility with TalkBack. Now, when TalkBack customers decide in on eligible units, the display screen reader makes use of Gemini Nano’s new multimodal capabilities to mechanically present customers with clear, detailed picture descriptions in apps together with Google Photographs and Chrome, even when the gadget is offline or has an unstable community connection.

“Gemini Nano helps fill in lacking info,” stated Lisie Lillianfeld, product supervisor at Google. “Whether or not it’s extra particulars about what’s in a photograph a good friend despatched or the fashion and lower of clothes when buying on-line.”

Going past primary picture descriptions

Right here’s an instance that illustrates how Gemini Nano improves picture descriptions: When Garcon is offered with a panorama of the Sydney, Australia shoreline at night time, it would learn: “Full moon over the ocean.” Gemini Nano with multimodality can paint a richer image, with an outline like: “A panoramic view of Sydney Opera Home and the Sydney Harbour Bridge from the north shore of Sydney, New South Wales, Australia.”

“It is superb how Nano can acknowledge one thing particular. As an example, the mannequin will acknowledge not only a tower, however the Eiffel Tower,” stated Lisie. “This sort of context takes benefit of the distinctive strengths of LLMs to ship a useful expertise for our customers.”

Utilizing an on-device mannequin like Gemini Nano was the one possible answer for TalkBack to supply mechanically generated detailed picture descriptions for pictures, even whereas the gadget is offline.

“The typical TalkBack person comes throughout 90 unlabeled pictures per day, and people pictures weren’t as accessible earlier than this new characteristic,” stated Lisie. The characteristic has gained optimistic person suggestions, with early testers writing that the brand new picture descriptions are a “recreation changer” and that it’s “fantastic” to have detailed picture descriptions constructed into TalkBack.

Gemini Nano with multimodality was critical to improving the experience for users with low vision. Providing detailed on-device image descriptions wouldn’t have been possible without it. — Lisie Lillianfeld, Product Manager at Google

Balancing inference verbosity and pace

One essential determination the Android accessibility group made when implementing Gemini Nano with multimodality was between inference verbosity and pace, which is partially decided by picture decision. Gemini Nano with multimodality at the moment accepts pictures in both 512 pixels or 768 pixels.

“The 512-pixel decision emitted its first token virtually two seconds quicker than 768 pixels, however the output wasn’t as detailed,” stated Tyler Freeman, a senior software program engineer at Google. “For our customers, we determined an extended, richer description was definitely worth the elevated latency. We have been in a position to cover the perceived latency a bit by streaming the tokens on to the text-to-speech system, so customers don’t have to attend for the complete textual content to be generated earlier than listening to a response.”

A hybrid answer utilizing Gemini Nano and Gemini 1.5 Flash

TalkBack builders additionally carried out a hybrid AI answer utilizing Gemini 1.5 Flash. With this server-based AI mannequin, TalkBack can present one of the best of on-device and server-based generative AI options to make the display screen reader much more highly effective.

When customers need extra particulars after listening to an mechanically generated picture description from Gemini Nano, TalkBack offers the person an choice to take heed to extra by working the picture by means of Gemini Flash. When customers concentrate on a picture, they will use a three-finger faucet to open the TalkBack menu and choose the “Describe Picture” choice to ship the picture to Gemini 1.5 Flash on the server and get much more particulars.

By combining the distinctive benefits of each Gemini Nano’s on-device processing with the complete energy of cloud-based Gemini 1.5 Flash, TalkBack supplies blind and low-vision Android customers a useful and informative expertise with pictures. The “describe picture” characteristic powered by Gemini 1.5 Flash launched to TalkBack customers on extra Android units, so much more customers can get detailed picture descriptions.

Animated UI example of TalkBack in action, describing a photo of a sunny view of Sydney Harbor, Australia, with the Sydney Opera House and Sydney Harbour Bridge in the frame.

Compact mannequin, large impression

The Android accessibility group recommends builders trying to make use of the Gemini Nano with multimodality prototype and take a look at on a strong, server-side mannequin first. There builders can perceive the UX quicker, iterate on immediate engineering, and get a greater thought of the best high quality doable utilizing probably the most succesful mannequin accessible.

Whereas Gemini Nano with multimodality can embody lacking context to enhance picture descriptions, it’s nonetheless greatest follow for builders to supply detailed alt textual content for all pictures on their apps or web sites. If the alt textual content just isn’t offered, TalkBack might help fill within the gaps.

The Android accessibility group’s objective is to create inclusive and accessible options, and leveraging Gemini Nano with multimodality to supply vivid and detailed picture descriptions mechanically is an enormous step in the direction of that. Moreover, their hybrid method in the direction of AI, combining the strengths of each Gemini Nano on gadget and Gemini 1.5 Flash within the server, showcases the transformative potential of AI in selling inclusivity and accessibility and highlights Google’s ongoing dedication to constructing for everybody.

Get began

Be taught extra about Gemini Nano for app improvement.

This weblog put up is a part of our sequence: Highlight Week on Android 15, the place we offer sources — weblog posts, movies, pattern code, and extra — all designed that can assist you put together your apps and benefit from the most recent options in Android 15. You may read more in the overview of Spotlight Week: Android 15, which might be up to date all through the week.

Leave a Reply

Your email address will not be published. Required fields are marked *