Construction d'un modèle articulatoire générique multilingue et multilocuteur à l'aide de données IRM temps réel

Offre de thèse

Date limite de candidature

15-05-2026

Date de début de contrat

01-10-2026

Directeur de thèse

LAPRIE Yves

Encadrement

Le doctorant aura un accès aux deux laboratoires Loria et IADI et les moyens techniques (ordinateur, accès aux clusters de calcul) lui permettant de travailler dans de très bonnes conditions. Une réunion de suivi aura lieu chaque semaine et chacune des deux équipes organise un séminaire scientifique hebdomadaire. Le doctorant aura aussi l'occasion de participer à une ou deux écoles d'été et aux conférences en IRM et en traitement automatique de la parole. Il sera aussi aidé pour la rédaction des articles de conférence ou de revue.

Type de contrat

ANR Financement d'Agences de financement de la recherche

Candidater à cette offre

école doctorale

IAEM - INFORMATIQUE - AUTOMATIQUE - ELECTRONIQUE - ELECTROTECHNIQUE - MATHEMATIQUES

équipe

MULTISPEECH

contexte

Research Environment The doctoral student will have access to real-time MRI databases already acquired as part of the ANR ArtSpeech project (approximately 10 minutes of speech from 10 speakers [6]) and the Full3FDTalkingHead project (approximately 6 hours of speech from 2 speakers), as well as several other projects. The IADI and Loria laboratories are undoubtedly the most advanced in the field of real-time MRI data acquisition and analysis, with recent work on articulatory acoustic inversion of speech signals [7]. The PhD student will, of course, also be able to participate in ongoing data collection for several languages using the MRI system available at the IADI laboratory. The scientific environments of the two teams are highly complementary, with strong expertise in all areas of MRI and anatomy within the IADI laboratory and in deep learning within the Loria MultiSpeech team. The two teams are geographically close (1.5 km).

spécialité

Informatique

laboratoire

LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications

Mots clés

apprentissage profond, mathématiques appliquées, traitement de la parole, IRM temps-réel, informatique, synthèse articulatoire

Détail de l'offre

Contexte et enjeux
La production de la parole repose sur la coordination précise des articulateurs (langue, mâchoire, lèvres), dont le mouvement modifie la géométrie du conduit vocal et, par extension, ses propriétés acoustiques. Contrairement aux méthodes de synthèse classiques privilégiant uniquement la qualité sonore, la synthèse articulatoire vise à maîtriser l'intégralité du processus de production. Ses applications sont vastes : compréhension des contrastes phonétiques, simulation de déficiences motrices, adaptation à de nouveaux locuteurs ou encore inversion acoustique-articulatoire.
Le cœur de cette technologie est le modèle articulatoire, qui définit la forme du conduit vocal via un nombre restreint de paramètres. Jusqu'à présent, ces modèles étaient souvent limités à un seul locuteur ou une seule langue, car basés sur des primitives géométriques ou des données IRM statiques.

Objectifs de la thèse
L'ambition de ce projet, porté par l'équipe MultiSpeech (Loria) en collaboration avec le laboratoire IADI (INSERM), est de dépasser ces restrictions en créant un modèle générique indépendant du locuteur et de la langue. Le travail s'articulera autour de deux axes majeurs :
1. Construction du modèle générique : Développer une méthode de normalisation anatomique permettant de s'affranchir des différences morphologiques entre individus pour créer un système de contrôle universel des articulateurs.
2. Adaptation dynamique : Définir une méthodologie pour projeter les phonèmes, lieux d'articulation et mouvements spécifiques (diphtongues, affriquées) d'une nouvelle langue ou d'un·e nouveau·elle locuteur·rice dans le référentiel du modèle générique.

Méthodologie et technologies
Le projet s'appuie sur des données massives issues de l'IRM temps réel (50 images par seconde), permettant de visualiser le conduit vocal en mouvement.
• Acquisition : Une base de données couvrant une trentaine de locuteurs et de langues est en cours de constitution au CHRU de Nancy.
• Traitement d'images : Pour améliorer le suivi précis des contours (notamment l'apex de la langue, crucial pour l'acoustique), le projet exploitera des architectures d'apprentissage profond de pointe, telles que le nnU-Net.
• Apprentissage profond : L'étudiant·e utilisera des techniques de Deep Learning pour modéliser la relation entre séquences de phonèmes et trajectoires articulatoires.

Environnement et profil recherché
Le·la doctorant·e évoluera dans un écosystème d'excellence, bénéficiant des ressources de calcul du Loria et de l'expertise en imagerie médicale de l'IADI. Le financement est assuré par le projet ANR ArtAny.

Compétences requises : * Maîtrise de l'apprentissage profond (Deep Learning).
• Solides bases en mathématiques appliquées et informatique.
• Un intérêt pour le traitement de la parole et l'imagerie médicale est un atout.
________________________________________
Points clés
• Laboratoires : Loria (MultiSpeech) & IADI.
• Mots-clés : Apprentissage profond, Synthèse articulatoire, IRM dynamique, nnU-Net, normalisation anatomique.
• Directeur de thèse : Yves Laprie.

Keywords

deep learning, computer science, speech processing, applied mathematics, real time MRI, articulatory speech synthesis

Subject details

Context and Rationale Speech production is a complex motor process involving the coordination of articulators (tongue, jaw, lips) to reshape the vocal tract, which in turn defines the acoustic properties of the voice. While modern speech synthesis often focuses on output quality, articulatory synthesis aims to replicate the physical production process itself. This approach is invaluable for understanding phonetic contrasts, simulating speech impairments, and performing acoustic-to-articulatory inversion. Historically, articulatory models have been limited by their reliance on static data or single-speaker/single-language datasets. The MultiSpeech team (Loria) recently advanced this field by using dynamic MRI to model a single speaker's vocal tract. The next frontier, and the focus of this thesis, is to move beyond individual constraints toward a universal model. Research Objectives The primary goal of this PhD project is to construct a generic articulatory model that is independent of both the specific speaker and the language being spoken. The research is divided into two strategic axes: 1. Generic Model Construction: This involves developing an anatomical normalization framework. By neutralizing individual morphological differences, the model can establish a universal control system for articulators that applies to any human anatomy. 2. Multilingual and multispeaker adaptation: The student will map the specific articulatory movements and places of articulation (including complex sounds like diphthongs and affricates) of new languages into the coordinate system of the generic model. Methodology and Innovation The project leverages cutting-edge real-time MRI (rtMRI) data, captured at 50 frames per second. This provides a high-fidelity view of the mid-sagittal plane of the vocal tract during natural speech. Data Acquisition: In collaboration with the IADI laboratory, data from approximately 30 speakers and languages is being collected. Advanced Tracking: To ensure acoustic accuracy, the project will implement state-of-the-art Deep Learning techniques—specifically the nnU-Net framework—to improve the tracking of critical articulators like the tongue tip. Modeling: The student will use deep learning to bridge the gap between phonemic sequences and the geometric evolution of the vocal tract. Scientific Environment The doctoral candidate will operate within a highly complementary environment: IADI (INSERM): World-class expertise in MRI acquisition and anatomical analysis. Loria (MultiSpeech): Leaders in Deep Learning and speech processing. Resources: Access to extensive existing databases (ANR ArtSpeech, Full3FDTalkingHead), high-performance computing clusters, and the ANR ArtAny project funding. Candidate Profile The ideal candidate (master in computer sciences and/or applied mathematics) should possess strong expertise in Deep Learning, applied mathematics, and computer science. Additional knowledge in speech processing or medical imaging (MRI) is highly desirable to navigate this interdisciplinary project. Key Information Supervisor: Yves Laprie Keywords: Articulatory Synthesis, Real-time MRI, Deep Learning, Anatomical Normalization, Speech Processing. Location: Nancy, France (Loria & IADI).

Profil du candidat

Le candidat (master en informatique ou mathématiques appliquées) doit posséder de solides connaissances en apprentissage profond, en mathématiques appliquées et en informatique. Des connaissances en traitement de la parole et en imagerie par résonance magnétique (IRM) seront également appréciées.

Candidate profile

The candidate (master in computer sciences or applied mathematics) must have a strong background in deep learning, applied mathematics, and computer science. Knowledge of speech processing and magnetic resonance imaging (MRI) is also desirable.

Référence biblio

[1] B. J. Kröger, V. Graf-Borttscheller, A. Lowit. (2008). Two- and Three-Dimensional Visual Articulatory Models for Pronunciation Training and for Treatment of Speech Disorders, Proc. Of Interspeech 2008, Brisbane, Australia
[2] Y. Laprie, J. Busset. (2011). Construction and evaluation of an articulatory model of the vocal tract, In : 19th European Signal Processing Conference - EUSIPCO-2011. – Barcelona, Spain
[3] Vinicius Ribeiro, Karyna Isaieva, Justine Leclere, Pierre-André Vuissoz, Yves Laprie. Automatic generation of the complete vocal tract shape from the sequence of phonemes to be articulated. Speech Communication, 2022, 141, pp.1-13. ⟨10.1016/j.specom.2022.04.004&#10217;. &#10216;hal-03650212&#10217;
[4] Vinicius Ribeiro, Karyna Isaieva, Justine Leclere, Jacques Felblinger, Pierre-André Vuissoz, et al.. Automatic segmentation of vocal tract articulators in real-time magnetic resonance imaging. Computer Methods and Programs in Biomedicine, 243 (2), &#10216;10.1016/j.cmpb.2023.107907&#10217;. &#10216;hal-04376938&#10217;
[5] Isensee, F., Jaeger, P.F., Kohl, S.A.A. et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18, 203–211 (2021). https://doi.org/10.1038/s41592-020-01008-z
[6] Karyna Isaieva, Yves Laprie, Justine Leclère, Ioannis K Douros, Jacques Felblinger, et al.. Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers. Scientific Data , 2021, 8 (1), pp.258. &#10216;10.1038/s41597-021-01041-3&#10217;. &#10216;hal-03507532&#10217;
[7] Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie. Reconstruction of the Complete Vocal Tract Contour Through Acoustic to Articulatory Inversion Using Real-Time MRI Data. Interspeech 2025, Aug 2025, Rotterdam (NL), Netherlands. pp.978-982, &#10216;10.21437/Interspeech.2025-963&#10217;. &#10216;hal-05293831&#10217;