Generative Probabilistic Alignment Models for Words and Subwords: a Systematic Exploration of the Limits and Potentials of Neural Parametrizations

Thèse Année : 2021

Generative Probabilistic Alignment Models for Words and Subwords: a Systematic Exploration of the Limits and Potentials of Neural Parametrizations Modèles d’alignement probabilistes génératifs pour les mots et sous-mots : une exploration systématique des limites et potentialités des paramétrisations neuronales

(1, 2)

1 (Bâtiment 507, rue du Belvédère, 91405, Orsay cedex - France) 1041968

Université Paris-Saclay (Bâtiment Bréguet, 3 Rue Joliot Curie 2e ét, 91190 Gif-sur-Yvette - France) 419361
CNRS - Centre National de la Recherche Scientifique : UPR3251 (France) 441569

"> LIMSI - Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
2 (Campus Universitaire bât 507 Rue du Belvédère 91405 Orsay cedex et Campus Universitaire bât 640 1 rue Raimond Castaing 91190 Gif-sur-Yvette - France) 1061259

Inria - Institut National de Recherche en Informatique et en Automatique (Domaine de Voluceau Rocquencourt - BP 105 78153 Le Chesnay Cedex - France) 300009
CentraleSupélec (3, rue Joliot Curie, Plateau de Moulon, 91192 GIF-SUR-YVETTE Cedex - France) 411575
Université Paris-Saclay (Bâtiment Bréguet, 3 Rue Joliot Curie 2e ét, 91190 Gif-sur-Yvette - France) 419361
CNRS - Centre National de la Recherche Scientifique : UMR9015 (France) 441569

"> LISN - Laboratoire Interdisciplinaire des Sciences du Numérique

Anh Khoa Ngo Ho

Fonction : Auteur
PersonId : 744016
IdHAL : anh-khoa-ngo-ho
ORCID : 0000-0002-4844-5012
IdRef : 255027419

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Laboratoire Interdisciplinaire des Sciences du Numérique

Résumé

Alignment consists of establishing a mapping between units in a bitext, combining a text in a source language and its translation in a target language. Alignments can be computed at several levels: between documents, between sentences, between phrases, between words, or even between smaller units end when one of the languages is morphologically complex, which implies to align fragments of words (morphemes). Alignments can also be considered between more complex linguistic structures such as trees or graphs. This is a complex, under-specified task that humans accomplish with difficulty. Its automation is a notoriously difficult problem in natural language processing, historically associated with the first probabilistic word-based translation models. The design of new models for natural language processing, based on distributed representations computed by neural networks, allows us to question and revisit the computation of these alignments. This research project, therefore, aims to comprehensively understand the limitations of existing statistical alignment models and to design neural models that can be learned without supervision to overcome these drawbacks and to improve the state of art in terms of alignment accuracy.

L'alignement consiste à mettre en correspondance des unités au sein de bitextes, associant un texte en langue source et sa traduction dans une langue cible. L'alignement peut se concevoir à plusieurs niveaux: entre phrases, entre groupes de mots, entre mots, voire à un niveau plus fin lorsque l'une des langues est morphologiquement complexe, ce qui implique d'aligner des fragments de mot (morphèmes). L'alignement peut être envisagé également sur des structures linguistiques plus complexes des arbres ou des graphes. Il s'agit d'une tâche complexe, sous-spécifiée, que les humains réalisent avec difficulté. Son automatisation est un problème exemplaire du traitement des langues, historiquement associé aux premiers modèles de traduction probabilistes. L'arrivée à maturité de nouveaux modèles pour le traitement automatique des langues, reposant sur des représentationts distribuées calculées par des réseaux de neurones permet de reposer la question du calcul de ces alignements. Cette recherche vise donc à concevoir des modèles neuronaux susceptibles d'être appris sans supervision pour dépasser certaines des limitations des modèles d'alignement statistique et améliorer l'état de l'art en matière de précision des alignements automatiques.

Mots clés

Machine translation Word alignment Artificial neural network

Traduction automatique Alignement de mots Réseaux de neurones artificiels

Domaines

Informatique [cs]

Fichier principal

Generative Probabilistic Alignment Models for Words and Subwords - a Systematic Exploration of the Limits and Potentials of Neural Parametrizations.pdf (4.66 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Anh Khoa NGO HO : Connectez-vous pour contacter le contributeur

https://hal.science/tel-03269967

Soumis le : jeudi 24 juin 2021-14:33:42

Dernière modification le : vendredi 17 mai 2024-16:36:03

Archivage à long terme le : samedi 25 septembre 2021-18:30:04

Dates et versions

tel-03269967 , version 1 (24-06-2021)

Identifiants

HAL Id : tel-03269967 , version 1

Citer

Anh Khoa Ngo Ho. Generative Probabilistic Alignment Models for Words and Subwords: a Systematic Exploration of the Limits and Potentials of Neural Parametrizations. Computer Science [cs]. Université Paris-Saclay, 2021. English. ⟨NNT : ⟩. ⟨tel-03269967⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA LIMSI CENTRALESUPELEC UNIV-PARIS-SACLAY LISN GS-ENGINEERING GS-COMPUTER-SCIENCE

133 Consultations

200 Téléchargements