Audio-visual Multiple Active Speaker Localisation in Reverberant Environments

Li, Zhao; Herfet, Thorsten; Grochulla, Martin Peter; Thormählen, Thorsten

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

Audio-visual Multiple Active Speaker Localisation in Reverberant Environments

MPS-Authors

/persons/resource/persons44529

Grochulla, Martin Peter
Computer Graphics, MPI for Informatics, Max Planck Society;

/persons/resource/persons45618

Thormählen, Thorsten
Computer Graphics, MPI for Informatics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

dafx12_submission_29.pdf
(Any fulltext), 2MB

Supplementary Material (public)

There is no public supplementary material available

Citation

Li, Z., Herfet, T., Grochulla, M. P., & Thormählen, T. (2012). Audio-visual Multiple Active Speaker Localisation in Reverberant Environments. In Proceedings of the 15th International Conference on Digital Audio Effects (DAFx-12) (pp. 1-8). York, UK.

Cite as: https://hdl.handle.net/11858/00-001M-0000-0014-F30F-5

Abstract

Localisation of multiple active speakers in natural environments with only two microphones is a challenging problem. Reverberation degrades the performance of speaker localisation based exclusively on directional cues. This paper presents an approach based on audio-visual fusion. The audio modality performs the multiple speaker localisation using the \em Skeleton method, energy weighting, and precedence effect filtering and weighting. The video modality performs the active speaker detection based on the analysis of the lip region of the detected speakers. The audio modality alone has problems with localisation accuracy, while the video modality alone has problems with false detections. The estimation results of both modalities are represented as probabilities in the azimuth domain. A Gaussian fusion method is proposed to combine the estimates in a late stage. As a consequence, the localisation accuracy and robustness compared to the audio/video modality alone is significantly increased. Experimental results in different scenarios confirmed the improved performance of the proposed method.