Multimodal Massive Language Models with Fusion Low Rank Adaptation for System Directed Speech Detection

Multimodal Massive Language Models with Fusion Low Rank Adaptation for System Directed Speech Detection
Multimodal Massive Language Models with Fusion Low Rank Adaptation for System Directed Speech Detection


Though Massive Language Models (LLMs) have proven promise for human-like conversations, they’re primarily pre-trained on textual content information. Incorporating audio or video improves efficiency, however gathering large-scale multimodal information and pre-training multimodal LLMs is difficult. To this finish, we suggest a Fusion Low Rank Adaptation (FLoRA) method that effectively adapts a pre-trained unimodal LLM to devour new, beforehand unseen modalities by way of low rank adaptation. For device-directed speech detection, utilizing FLoRA, the multimodal LLM achieves 22% relative discount in equal error charge (EER) over the text-only strategy and attains efficiency parity with its full fine-tuning (FFT) counterpart whereas needing to tune solely a fraction of its parameters. Moreover, with the newly launched adapter dropout, FLoRA is powerful to lacking information, bettering over FFT by 20% decrease EER and 56% decrease false settle for charge. The proposed strategy scales nicely for mannequin sizes from 16M to 3B parameters.

Leave a Reply

Your email address will not be published. Required fields are marked *