Pawel Swietojanski (Scientia Fellow @CSE)

Static Visual Spatial Priors for Speech Processing

Abstract

Human perception relies on multi-modal processing integrating many sensory inputs and auxiliary knowledge sources. In the particular case of hearing, research has shown that under high acoustic uncertainty (noisy / reverberant acoustic environments, unfamiliar speakers, simultaneous speech, etc.), our brains can efficiently leverage auxiliary sources such as vision and/or higher level conceptual understanding / common sense reasoning to better separate target speech sources or to fill inaudible gaps. It is thus interesting to consider similar mechanisms when designing machine perception algorithms. In this talk I will focus on speech processing, and on one special case of visual information we call static visual spatial prior. Such priors can be estimated infrequently, independently and asynchronously from the primary audio stream. This allows to relax compute-related requirements for parallel processing allowing for edge deployments which in turn offer similar level of privacy as one would get with sole audio stream. We show its efficacy on two benchmarks, one measures the accuracy of Direction of Arrival of Sound estimation (DoA). DoA is a basic building block in many applications that relies on microphone arrays to capture spatial sound. Another benchmark builds on top of DoA and shows how such priors with additional semantic information can improve acoustic modelling for distant speech recognition in ambiguous acoustic environments.

Bio

Pawel Swietojanski is currently Scientia Fellow and Lecturer in the School of Computer Science and Engineering at UNSW Sydney. His main research interests include machine learning and its applications in spoken language processing. Prior to UNSW Pawel held research related positions in both industry and academia. Pawel got awarded PhD in Computer Science from University of Edinburgh in 2016.