[ad_1]
System-directed speech detection (DDSD) is the binary classification job of distinguishing between queries directed at a voice assistant versus aspect dialog or background speech. State-of-the-art DDSD methods use verbal cues (for instance, acoustic, textual content and/or computerized speech recognition system (ASR) options) to categorise speech as device-directed or in any other case, and sometimes should cope with a number of of those modalities being unavailable when deployed in real-world settings. On this paper, we examine fusion schemes for DDSD methods that may be made extra sturdy to lacking modalities. Concurrently, we research the usage of non-verbal cues, particularly prosody options, along with verbal cues for DDSD. We current completely different approaches to mix scores and embeddings from prosody with the corresponding verbal cues, discovering that prosody improves DDSD efficiency by as much as 8.5% when it comes to false acceptance charge (FA) at a given mounted working level by way of non-linear intermediate fusion, whereas our use of modality dropout methods improves the efficiency of those fashions by 7.4% when it comes to FA when evaluated with lacking modalities throughout inference time.
[ad_2]
Source link