Apprentissage de representation latentes pour l'audio spatialisé

Offre de thèse

Date limite de candidature

31-05-2026

Date de début de contrat

01-10-2026

Directeur de thèse

SERIZEL Romain

Encadrement

Suivi régulier avec les encadrants.

Type de contrat

ANR Financement d'Agences de financement de la recherche

Candidater à cette offre

école doctorale

IAEM - INFORMATIQUE - AUTOMATIQUE - ELECTRONIQUE - ELECTROTECHNIQUE - MATHEMATIQUES

équipe

MULTISPEECH

contexte

This PhD takes place within the ANR-TSIA project Crosstalk. The project involves researchers from Université de Lorraine/LORIA in Nancy (France) and Institut de l'Audition in Paris (France).

spécialité

Informatique

laboratoire

LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications

Mots clés

Traitement du signal, IA, Signal audio, Apprentissage automatique

Détail de l'offre

Motivations
Les humains utilisent l'audition binaurale pour traiter des scènes sonores complexes, comme la communication verbale en milieu bruyant (cocktail party problem). Malgré des recherches approfondies, cette capacité reste mal comprise [1]. Les chercheurs s'inspirent de ce phénomène physiologique pour développer des techniques de traitement audio multicanal, comme l'amélioration de la parole ou l'analyse computationnelle des scènes auditives. Si certains travaux anciens en CASA (Computational Auditory Scene Analysis) exploraient la perception humaine des scènes audio, la plupart des recherches se concentrent sur les propriétés acoustiques du signal pour extraire une source [2] ou localiser des sons [3]. Peu d'études abordent l'intégration de ces informations.
Les avancées récentes en apprentissage profond ont montré que les modèles peuvent structurer efficacement l'espace latent des signaux audio, en mettant l'accent sur le contenu phonétique. Parallèlement, des efforts visent à relier les modèles de traitement audio à la perception auditive, au niveau cérébral [4] ou psychophysique [5]. Cependant, ces travaux se limitent souvent à un seul aspect : le contenu ou la localisation spatiale de la source, alors que les deux aspects sont essentiels pour comprendre comment les humains résolvent le problème de cocktail party. Ce projet vise à faire progresser la modélisation audio en proposant de nouveaux modèles de représentations multicanal avec des espaces latents structurés.

Objectifs
Cette thèse explore la représentation de la position d'une source sonore indépendamment de son contenu, ainsi que l'interaction entre localisation et représentation du signal. La plupart des modèles existants se focalisent sur le contenu du signal, ignorant la localisation spatiale, notamment parce qu'ils sont monocanaux et limités en informations spatiales. Nous étendrons d'abord ces modèles monocanaux pour exploiter des entrées multicanaux, intégrant ainsi les informations spatiales. Ensuite, nous contraindrons l'espace latent du modèle multicanal pour y encoder explicitement la localisation de la source.
Deux approches seront étudiées. Avec connaissance explicite de la position de la source pendant l'apprentissage : utilisation d'un cadre multitâche (reconstruction du signal + localisation) ou d'une contrainte directe sur l'espace latent (tâche de localisation supplémentaire). Cette méthode a prouvé son efficacité pour reproduire des structures cérébrales naturelles liées à l'identité sonore dans l'espace latent. Sans connaissance explicite : utilisation d'approches étudiant-enseignant (ciblage de localisation issu d'un réseau auxiliaire [6]) ou auto-supervisées (apprentissage par contraste [8], où deux versions d'une même scène sont fournies au modèle après transformations n'affectant pas la localisation) [7].

Keywords

Signal Processing, AI, Machine Learning, Audio signal

Subject details

Motivations Humans rely on binaural hearing to process complex sound scenes. One example is speech communication in noisy environments with many competing sources, a challenge known as the cocktail party problem. Despite numerous studies on the topic, this ability is not fully understood [1]. Nevertheless, researchers have drawn inspiration from this physiological phenomenon for decades to develop multi-channel audio processing techniques, such as speech enhancement algorithms and computational auditory scene analysis. While some early work in CASA focused on how humans perceive complex audio scenes, most efforts have concentrated on the acoustic properties of the audio signal to achieve specific goals, such as extracting a source of interest [2] or localizing sound sources in space [3]. These studies rarely address how this information could be coded and processed together. Recent work in deep learning-based audio representations has demonstrated the strong ability of models to provide a well-structured latent space for audio signals [4], emphasizing aspects such as phonetic content. Concurrently, there has been an increasing effort to connect audio processing models (such as speech recognition or sound source localization models) to auditory perception, either at the brain level [4] or at the psychophysical level [5]. However, these works primarily focus on one aspect of the signal, either its content or the spatial localization of sound sources but not both at the same time which is an important aspect to understand how humans tackle the cocktail party problem. The aim of the project is to advance research in computer-based audio modeling and auditory modeling by proposing new multi-channel audio representation models with structured latent spaces. Goals and Objectives This work during the PhD aims to explore the representation of a sound source's position independently of its content, as well as the interplay between localization and source signal representation. Most existing models focus on representing the signal content while ignoring acoustic aspects such as sound source localization. This is partly due to the fact that these models are mainly single-channel models that can hardly access any spatial information. In a first step we will extend existing single-channel audio representation models to leverage multi-channel inputs that can account for spatial information. Then we will shape the latent space of the multi-channel audio representation model to explicitly encode the spatial localization of the sound source. Shape the latent space of the multichannel audio representation model to explicitly encode the spatial localization of the sound source. We will investigate two approaches: with or without explicit knowledge of the source position during training. When the source position is explicitly known at training time, we will leverage this information in a multitask framework (where the decoder is decomposed into one branch to reconstruct the signal and another to localize the signal) or as a constraint applied directly on the latent space (by performing an additional source localization task on the latent representation). This has been shown to be efficient in reproducing, for example, natural auditory cortex representation structures for sound identity in the latent space. To leverage multi-channel data without explicit information, we will investigate student-teacher approaches where the localization target is obtained from an auxiliary localization network [6] and self-supervised approaches [7] for example based on contrastive learning [8], where two versions of the same scene (subject to transforms that do not impact localization) are fed to the model.

Profil du candidat

Excellente maîtrise de la programmation en Python. La connaissance de PyTorch est un atout.
Formation en apprentissage profond et traitement du signal. Des connaissances ou un intérêt pour l'audio, l'acoustique, les méthodes numériques ou l'optimisation sont des atouts supplémentaires.
Niveau master 2 (en informatique, traitement du signal, apprentissage machine, acoustique ou mathématiques appliquées) avec un fort intérêt pour la recherche académique.

Candidate profile

• Excellent level in Python programming. PyTorch knowledge is an added value.
• Training in Deep Learning and Signal Processing. Additional knowledge or interest for audio, acoustics, numerical methods or optimization are an added value.
• 2nd year master level (in computer science, signal processing, machine learning, acoustics or applied mathematics) with a strong interest for academic research

Référence biblio

[1] McDermott, J. H. The cocktail party problem. Curr. Biol. 19, R1024–R1027 (2009).
[2] Makino, S. Audio Source Separation. (Springer International Publishing AG, New York, NY, 2018).
[3] Krause, D. A., García-Barrios, G., Politis, A. & Mesaros, A. Binaural Sound Source Distance
Estimation and Localization for a Moving Listener. IEEEACM Trans. Audio Speech Lang. Process.
32, 996–1011 (2024).
[4] Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere, S. V. & McDermott, J. H. A
Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses,
and Reveals a Cortical Processing Hierarchy. Neuron 98, 630-644.e16 (2018).
[5] Francl, A. & McDermott, J. H. Deep neural network models of sound localization reveal how
perception is adapted to real-world environments. Nat. Hum. Behav. 6, 111–133 (2022).
[6] Watanabe, S., Hori, T., Le Roux, J. & Hershey, J. R. Student-teacher network learning with
enhanced features. in 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP) 5275–5279 (2017). doi:10.1109/ICASSP.2017.7953163.
[7] Albelwi, S. Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning
Methods in Imaging. Entropy 24, 551 (2022).
[8] Le-Khac, P. H., Healy, G. & Smeaton, A. F. Contrastive Representation Learning: A Framework
and Review. IEEE Access 8, 193907–193934 (2020).