Contextualization of ASR with LLM Utilizing Phonetic Retrieval-Based mostly Augmentation

Contextualization of ASR with LLM Utilizing Phonetic Retrieval-Based mostly Augmentation
Contextualization of ASR with LLM Utilizing Phonetic Retrieval-Based mostly Augmentation


Massive language fashions (LLMs) have proven excellent functionality of modeling multimodal indicators together with audio and textual content, permitting the mannequin to generate spoken or textual response given a speech enter. Nevertheless, it stays a problem for the mannequin to acknowledge private named entities, comparable to contacts in a cellphone e-book, when the enter modality is speech. On this work, we begin with a speech recognition activity and suggest a retrieval-based resolution to contextualize the LLM: we first let the LLM detect named entities in speech with none context, then use this named entity as a question to retrieve phonetically related named entities from a private database and feed them to the LLM, and at last run context-aware LLM decoding. In a voice assistant activity, our resolution achieved as much as 30.2% relative phrase error fee discount and 73.6% relative named entity error fee discount in comparison with a baseline system with out contextualization. Notably, our resolution by design avoids prompting the LLM with the complete named entity database, making it extremely environment friendly and relevant to massive named entity databases.

Leave a Reply

Your email address will not be published. Required fields are marked *