非表示:
キーワード:
-
要旨:
Localisation of multiple active speakers in natural environments with only two
microphones is a challenging problem. Reverberation degrades the performance of
speaker localisation based exclusively on directional cues. This paper presents
an approach based on audio-visual fusion. The audio modality performs the
multiple speaker localisation using the \em Skeleton method, energy
weighting, and precedence effect filtering and weighting. The video modality
performs the active speaker detection based on the analysis of the lip region
of the detected speakers. The audio modality alone has problems with
localisation accuracy, while the video modality alone has problems with false
detections. The estimation results of both modalities are represented as
probabilities in the azimuth domain. A Gaussian fusion method is proposed to
combine the estimates in a late stage. As a consequence, the localisation
accuracy and robustness compared to the audio/video modality alone is
significantly increased. Experimental results in different scenarios confirmed
the improved performance of the proposed method.