Loughborough University
Leicestershire, UK
LE11 3TU
+44 (0)1509 263171
Loughborough University

Loughborough University Institutional Repository

Please use this identifier to cite or link to this item: https://dspace.lboro.ac.uk/2134/12237

Title: Video-aided model-based source separation in real reverberant rooms
Authors: Khan, Muhammad Salman
Naqvi, Syed M.R.
ur-Rehman, Ata
Wang, Wenwu
Chambers, Jonathon
Keywords: Source separation
Spatial cues
Time-frequency masking
Issue Date: 2013
Publisher: © IEEE
Citation: KHAN, M.S. ... et al., 2013. Video-aided model-based source separation in real reverberant rooms. IEEE Transactions on Audio, Speech and Language Processing, 21 (9), pp. 1900 - 1912.
Abstract: Source separation algorithms that utilize only audio data can perform poorly if multiple sources or reverberation are present. In this paper we therefore propose a video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static. By exploiting cues from video, we first localize individual speech sources in the enclosure and then estimate their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors are probabilistically modeled. The models make use of the source direction information and are evaluated at discrete timefrequency points. The model parameters are refined with the wellknown expectation-maximization (EM) algorithm. The algorithm outputs time-frequency masks that are used to reconstruct the individual sources. Simulation results show that by utilizing the visual modality the proposed algorithm can produce better timefrequency masks thereby giving improved source estimates. We provide experimental results to test the proposed algorithm in different scenarios and provide comparisons with both other audio-only and audio-visual algorithms and achieve improved performance both on synthetic and real data. We also include dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited.
Description: This article was published in the journal, IEEE Transactions on Audio, Speech and Language Processing [© IEEE] and the definitive version is available at: http://dx.doi.org/10.1109/TASL.2013.2261814 [Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.]
Version: Accepted for publication
DOI: 10.1109/TASL.2013.2261814
URI: https://dspace.lboro.ac.uk/2134/12237
Publisher Link: http://dx.doi.org/10.1109/TASL.2013.2261814
Appears in Collections:Published Articles (Mechanical, Electrical and Manufacturing Engineering)

Files associated with this item:

File Description SizeFormat
Video-aided model-based-Khan et al.pdfAccepted version2.41 MBAdobe PDFView/Open


SFX Query

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.