[ad_1]
This paper was accepted on the workshop HSCMA at ICASSP 2024.
Voice triggering (VT) permits customers to activate their gadgets by simply talking a set off phrase. A front-end system is often used to carry out speech enhancement and/or separation, and produces a number of enhanced and/or separated alerts. Since typical VT programs take solely single-channel audio as enter, channel choice is carried out. A disadvantage of this strategy is that unselected channels are discarded, even when the discarded channels may comprise helpful data for VT. On this work, we suggest multichannel acoustic fashions for VT, the place the multichannel output from the frond-end is fed instantly right into a VT mannequin. We undertake a transform-average-concatenate (TAC) block and modify the TAC block by incorporating the channel from the standard channel choice in order that the mannequin can attend to a goal speaker when a number of audio system are current. The proposed strategy achieves as much as 30% discount within the false rejection charge in comparison with the baseline channel choice strategy.
[ad_2]
Source link