[ad_1]
*=Equal Contributors
This paper was accepted on the Environment friendly Pure Language and Speech Processing workshop at NeurIPS 2023.
Interactions with digital assistants typically start with a predefined set off phrase adopted by the person command. To make interactions with the assistant extra pure, we discover whether or not it’s possible to drop the requirement that customers should start every command with a set off phrase. We handle this process by combining the decoder alerts of an automated speech recognition (ASR) system with acoustic and lexical representations as enter options to a big language mannequin (LLM). We’re occupied with data- and resource-efficient techniques that require solely a small quantity of coaching knowledge and may doubtlessly run on gadgets comparable to smartphones. Because of this, our mannequin is finetuned on a small quantity of multimodal knowledge utilizing low-rank adaptation. We examine the proposed system to unimodal fashions that rely both on lexical or acoustic data solely. The effectiveness of our methodology is analyzed by finetuning decoder-only LLMs with sizes between 3 billion and 13 billion parameters on coaching knowledge consisting of 10k to 80k utterances. We present that our greatest multimodal system yields higher outcomes than unimodal baselines whereas utilizing solely a fraction of the coaching knowledge.
[ad_2]
Source link