de.mpg.escidoc.pubman.appbase.FacesBean
Deutsch
 
Hilfe Wegweiser Impressum Kontakt Einloggen
  DetailsucheBrowse

Datensatz

DATENSATZ AKTIONENEXPORT

Freigegeben

Konferenzbeitrag

Audio-visual Multiple Active Speaker Localisation in Reverberant Environments

MPG-Autoren
http://pubman.mpdl.mpg.de/cone/persons/resource/persons44529

Grochulla,  Martin Peter
Computer Graphics, MPI for Informatics, Max Planck Society;

http://pubman.mpdl.mpg.de/cone/persons/resource/persons45618

Thormählen,  Thorsten
Computer Graphics, MPI for Informatics, Max Planck Society;

Externe Ressourcen
Es sind keine Externen Ressourcen verfügbar
Volltexte (frei zugänglich)

dafx12_submission_29.pdf
(beliebiger Volltext), 2MB

Ergänzendes Material (frei zugänglich)
Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar
Zitation

Li, Z., Herfet, T., Grochulla, M. P., & Thormählen, T. (2012). Audio-visual Multiple Active Speaker Localisation in Reverberant Environments. In Proceedings of the 15th International Conference on Digital Audio Effects (DAFx-12) (pp. 1-8). York, UK.


Zitierlink: http://hdl.handle.net/11858/00-001M-0000-0014-F30F-5
Zusammenfassung
Localisation of multiple active speakers in natural environments with only two microphones is a challenging problem. Reverberation degrades the performance of speaker localisation based exclusively on directional cues. This paper presents an approach based on audio-visual fusion. The audio modality performs the multiple speaker localisation using the \em Skeleton method, energy weighting, and precedence effect filtering and weighting. The video modality performs the active speaker detection based on the analysis of the lip region of the detected speakers. The audio modality alone has problems with localisation accuracy, while the video modality alone has problems with false detections. The estimation results of both modalities are represented as probabilities in the azimuth domain. A Gaussian fusion method is proposed to combine the estimates in a late stage. As a consequence, the localisation accuracy and robustness compared to the audio/video modality alone is significantly increased. Experimental results in different scenarios confirmed the improved performance of the proposed method.