Abstract: Audio comprehension—including speech, non-speech sounds, and music—is essential for AI agents to interact effectively with the world. Yet, research in audio processing has lagged behind other areas like language and vision, hindered by limited datasets, the need for advanced architectures, and training methods suited to the inherent complexities of audio. At the same time, the rise of Large Language Models (LLMs) offers promising new directions, as they have shown a remarkable ability to understand and reason about the world through language, pushing forward foundational audio tasks like Automatic Speech Recognition (ASR), cross-modal retrieval, and audio captioning.
Our group is trying to bridge this gap with various innovative solutions, starting with GAMA, our large audio-language model designed for advanced audio perception and complex reasoning. GAMA is built with a specialized architecture, optimized audio encoding, and a novel alignment dataset, positioning it as a leader across benchmarks for audio understanding, reasoning, and hallucination reduction. Good representations are key to advancing perception and GAMA’s development builds on our past achievements, such as MAST and SLICER and EH-MAM, which are novel approaches for learning strong audio representations from unlabeled data. Complementing this, we introduced ReCLAP, a state-of-the-art audio-language encoder, and CompA, one of the first projects to tackle compositional reasoning in audio-language models—a critical challenge given audio’s inherently compositional nature.
Looking forward, we envision LALMs becoming integral to daily life, capable of conversational speech QA, information-extraction-based QA, and addressing knowledge-driven questions about diverse audio inputs. Achieving these ambitious goals requires both advanced data and architectures. Synthio, our latest synthetic data generation framework, supports this mission by generating data for complex audio understanding. Progress must also be measurable, so we’re dedicated to establishing comprehensive benchmarks. Our recent work, MMAU, rigorously tests LALMs on real-world tasks.
Speakers Bio: Dinesh Manocha is Paul Chrisman-Iribe Chair in Computer Science & ECE and Distinguished University Professor at University of Maryland College Park. His research interests include virtual environments, physically-basedmodeling, and robotics. His group has developed a number of software packages that are standard and licensed to 60+ commercial vendors. He has published more than 800 papers & supervised 52 PhD dissertations. He is a Fellow of AAAI, AAAS, ACM, IEEE, and NAI and member of ACM SIGGRAPH and IEEE VR Academies, and Bézier Award from Solid Modeling Association. He received the Distinguished Alumni Award from IIT Delhi the Distinguished Career in Computer Science Award from Washington Academy of Sciences. He was a co-founder of Impulsonic, a developer of physics-based audio simulation technologies, which was acquired by Valve Inc in November 2016.