Publications

Lertvittayakumjorn P, Specia L, Toni F, 2020, FIND: human-in-the-loop debugging deep text classifiers

Since obtaining a perfect training dataset (i.e., a dataset which isconsiderably large, unbiased, and well-representative of unseen cases) ishardly possible, many real-world text classifiers are trained on the available,yet imperfect, datasets. These classifiers are thus likely to have undesirableproperties. For instance, they may have biases against some sub-populations ormay not work effectively in the wild due to overfitting. In this paper, wepropose FIND -- a framework which enables humans to debug deep learning textclassifiers by disabling irrelevant hidden features. Experiments show that byusing FIND, humans can improve CNN text classifiers which were trained underdifferent types of imperfect datasets (including datasets with biases anddatasets with dissimilar train-test distributions).

Working paper

Fomicheva M, Sun S, Fonseca E, Blain F, Chaudhary V, Guzmán F, Lopatina N, Specia L, Martins AFTet al., 2020, MLQE-PE: A multilingual quality estimation and post-editing dataset, Publisher: arXiv

We present MLQE-PE, a new dataset for Machine Translation (MT) QualityEstimation (QE) and Automatic Post-Editing (APE). The dataset contains sevenlanguage pairs, with human labels for 9,000 translations per language pair inthe following formats: sentence-level direct assessments and post-editingeffort, and word-level good/bad labels. It also contains the post-editedsentences, as well as titles of the articles where the sentences were extractedfrom, and the neural MT models used to translate the text.

Working paper

Lertvittayakumjorn P, Specia L, Toni F, 2020, FIND: Human-in-the-loop debugging deep text classifiers, 2020 Conference on Empirical Methods in Natural Language Processing, Publisher: ACL

Since obtaining a perfect training dataset (i.e., a dataset which is considerably large, unbiased, and well-representative of unseen cases)is hardly possible, many real-world text classifiers are trained on the available, yet imperfect, datasets. These classifiers are thus likely to have undesirable properties. For instance, they may have biases against some sub-populations or may not work effectively in the wild due to overfitting. In this paper, we propose FIND–a framework which enables humans to debug deep learning text classifiers by disabling irrelevant hidden features. Experiments show that by using FIND, humans can improve CNN text classifiers which were trained under different types of imperfect datasets (including datasets with biases and datasets with dissimilar train-test distributions).

Conference paper

Caglayan O, Ive J, Haralampieva V, Madhyastha P, Barrault L, Specia Let al., 2020, Simultaneous machine translation with visual context, Transactions of the Association for Computational Linguistics, Vol: 8, Pages: 539-555, ISSN: 2307-387X

Simultaneous machine translation (SiMT) aims to translate a continuous inputtext stream into another language with the lowest latency and highest qualitypossible. The translation thus has to start with an incomplete source text,which is read progressively, creating the need for anticipation. In this paper,we seek to understand whether the addition of visual information can compensatefor the missing source context. To this end, we analyse the impact of differentmultimodal approaches and visual features on state-of-the-art SiMT frameworks.Our results show that visual context is helpful and that visually-groundedmodels based on explicit object region information are much better thancommonly used global features, reaching up to 3 BLEU points improvement underlow latency scenarios. Our qualitative analysis illustrates cases where onlythe multimodal systems are able to translate correctly from English intogender-marked languages, as well as deal with differences in word order, suchas adjective-noun placement between English and French.

Journal article

Fomicheva M, Sun S, Yankovskaya L, Blain F, Guzmán F, Fishel M, Aletras N, Chaudhary V, Specia Let al., 2020, Unsupervised quality estimation for neural machine translation, Transactions of the Association for Computational Linguistics, Vol: 8, Pages: 539-555, ISSN: 2307-387X

Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By employing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.

Journal article

Fomicheva M, Sun S, Yankovskaya L, Blain F, Guzmán F, Fishel M, Aletras N, Chaudhary V, Specia Let al., 2020, Unsupervised quality estimation for neural machine translation, Publisher: arXiv

Quality Estimation (QE) is an important component in making MachineTranslation (MT) useful in real-world applications, as it is aimed to informthe user on the quality of the MT output at test time. Existing approachesrequire large amounts of expert annotated data, computation and time fortraining. As an alternative, we devise an unsupervised approach to QE where notraining or access to additional resources besides the MT system itself isrequired. Different from most of the current work that treats the MT system asa black box, we explore useful information that can be extracted from the MTsystem as a by-product of translation. By employing methods for uncertaintyquantification, we achieve very good correlation with human judgments ofquality, rivalling state-of-the-art supervised QE models. To evaluate ourapproach we collect the first dataset that enables work on both black-box andglass-box approaches to QE.

Working paper

Alva-Manchego F, Scarton C, Specia L, 2020, Data-driven sentence simplification: survey and benchmark, Computational Linguistics, Vol: 46, Pages: 135-187, ISSN: 0891-2017

Sentence Simplification (SS) aims to modify a sentence in order to make it easier to read and understand. In order to do so, several rewriting transformations can be performed such as replacement, reordering, and splitting. Executing these transformations while keeping sentences grammatical, preserving their main idea, and generating simpler output, is a challenging and still far from solved problem. In this article, we survey research on SS, focusing on approaches that attempt to learn how to simplify using corpora of aligned original-simplified sentence pairs in English, which is the dominant paradigm nowadays. We also include a benchmark of different approaches on common data sets so as to compare them and highlight their strengths and limitations. We expect that this survey will serve as a starting point for researchers interested in the task and help spark new ideas for future developments.

Journal article

Specia L, Barrault L, Caglayan O, Duarte A, Elliott D, Gella S, Holzenberger N, Lala C, Lee SJ, Libovicky J, Madhyastha P, Metze F, Mulligan K, Ostapenko A, Palaskar S, Sanabria R, Wang J, Arora Ret al., 2020, Grounded Sequence to Sequence Transduction, IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, Vol: 14, Pages: 577-591, ISSN: 1932-4553

Journal article

Caglayan O, Ive J, Haralampieva V, Madhyastha P, Barrault L, Specia Let al., 2020, Simultaneous Machine Translation with Visual Context, Conference on Empirical Methods in Natural Language Processing (EMNLP), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 2350-2361

Author Web Link
Cite
Citations: 7

Conference paper

Ive J, Specia L, Szoc S, Vanallemeersch T, Van den Bogaert J, Farah E, Maroti C, Ventura A, Khalilov Met al., 2020, A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?, 12th International Conference on Language Resources and Evaluation (LREC), Publisher: EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA, Pages: 3692-3697

Author Web Link
Cite
Citations: 3

Conference paper

Okabe S, Blain F, Specia L, 2020, Multimodal Quality Estimation for Machine Translation, 58th Annual Meeting of the Association-for-Computational-Linguistics (ACL), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 1233-1240

Conference paper

Fomincheva M, Specia L, Guzman F, 2020, Multi-Hypothesis Machine Translation Evaluation, 58th Annual Meeting of the Association-for-Computational-Linguistics (ACL), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 1218-1232

Author Web Link
Cite
Citations: 6

Conference paper

Scarton C, Madhyastha P, Specia L, 2020, Deciding When, How and for Whom to Simplify, 24th European Conference on Artificial Intelligence (ECAI), Publisher: IOS PRESS, Pages: 2172-2179, ISSN: 0922-6389

Author Web Link
Cite
Citations: 1

Conference paper

Li Z, Fomicheva M, Specia L, 2020, Exploring Model Consensus to Generate Translation Paraphrases, 4th Workshop on Neural Generation and Translation, Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 161-168

Conference paper

Sung S, Fomicheva M, Blain F, Chaudhary V, El-Kishky A, Renduchintala A, Guzman F, Specia Let al., 2020, An Exploratory Study on Multilingual Quality Estimation, 1st Conference of the Asia-Pacific Chapter of the Association-for-Computational-Linguistics / 10th International Joint Conference on Natural Language Processing (AACL-IJCNLP), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 366-377

Author Web Link
Cite
Citations: 2

Conference paper

Alva-Manchego F, Martin L, Bordes A, Scarton C, Sagot B, Specia Let al., 2020, ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), Pages: 4668-4679

Author Web Link
Cite
Citations: 14

Journal article

Sulubacak U, Caglayan O, Grönroos S-A, Rouhe A, Elliott D, Specia L, Tiedemann Jet al., 2019, Multimodal machine translation through visuals and speech, Publisher: arXiv

Multimodal machine translation involves drawing information from more thanone modality, based on the assumption that the additional modalities willcontain useful alternative views of the input data. The most prominent tasks inthis area are spoken language translation, image-guided translation, andvideo-guided translation, which exploit audio and visual modalities,respectively. These tasks are distinguished from their monolingual counterpartsof speech recognition, image captioning, and video captioning by therequirement of models to generate outputs in a different language. This surveyreviews the major data resources for these tasks, the evaluation campaignsconcentrated around them, the state of the art in end-to-end and pipelineapproaches, and also the challenges in performance evaluation. The paperconcludes with a discussion of directions for future research in these areas:the need for more expansive and challenging datasets, for targeted evaluationsof model performance, and for multimodality in both the input and output space.

Working paper

Ive J, Madhyastha P, Specia L, 2019, Deep copycat networks for text-to-text generation., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Pages: 3225-3234

Most text-to-text generation tasks, for example text summarisation and text simplification, require copying words from the input to the output. We introduce Copycat, a transformer-based pointer network for such tasks which obtains competitive results in abstractive text summarisation and generates more abstractive summaries. We propose a further extension of this architecture for automatic post-editing, where generation is conditioned over two inputs (source language and machine translation), and the model is capable of deciding where to copy information from. This approach achieves competitive performance when compared to state-of-the-art automated post-editing systems. More importantly, we show that it addresses a well-known limitation of automatic post-editing - overcorrecting translations - and that our novel mechanism for copying source language words improves the results.

Conference paper

Li Z, Specia L, 2019, Improving neural machine translation robustness via data augmentation: beyond back-translation, Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), Publisher: Association for Computational Linguistics, Pages: 328-336

Neural Machine Translation (NMT) models have been proved strong when translating clean texts, but they are very sensitive to noise in the input. Improving NMT models robustness can be seen as a form of “domain” adaption to noise. The recently created Machine Translation on Noisy Text task corpus provides noisy-clean parallel data for a few language pairs, but this data is very limited in size and diversity. The state-of-the-art approaches are heavily dependent on large volumes of back-translated data. This paper has two main contributions: Firstly, we propose new data augmentation methods to extend limited noisy data and further improve NMT robustness to noise while keeping the models small. Secondly, we explore the effect of utilizing noise from external data in the form of speech transcripts and show that it could help robustness.

Conference paper

Wu Z, Caglayan O, Ive J, Wang J, Specia Let al., 2019, Transformer-based Cascaded Multimodal Speech Translation

This paper describes the cascaded multimodal speech translation systemsdeveloped by Imperial College London for the IWSLT 2019 evaluation campaign.The architecture consists of an automatic speech recognition (ASR) systemfollowed by a Transformer-based multimodal machine translation (MMT) system.While the ASR component is identical across the experiments, the MMT modelvaries in terms of the way of integrating the visual context (simpleconditioning vs. attention), the type of visual features exploited (pooled,convolutional, action categories) and the underlying architecture. For thelatter, we explore both the canonical transformer and its deliberation versionwith additive and cascade variants which differ in how they integrate thetextual attention. Upon conducting extensive experiments, we found that (i) theexplored visual integration schemes often harm the translation performance forthe transformer and additive deliberation, but considerably improve the cascadedeliberation; (ii) the transformer and cascade deliberation integrate thevisual modality better than the additive deliberation, as shown by theincongruence analysis.

Working paper

Fomicheva M, Specia L, 2019, Taking MT evaluation metrics to extremes: beyond correlation with human judgments, Computational Linguistics, Vol: 45, Pages: 515-558, ISSN: 0891-2017

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria.

Journal article

Madhyastha P, Wang J, Specia L, 2019, End-to-end image captioning exploits multimodal distributional similarity

We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn 'distributional similarity' in a multimodal feature space by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the 'image' side of image captioning, and vary the input image representation but keep the RNN text generation component of a CNN-RNN model constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations; (ii) suffer virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together. Our findings indicate that our distributional similarity hypothesis holds. We conclude that regardless of the image representation used image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.

Abstract
Cite
Citations: 2

Conference paper

Wang J, Specia L, 2019, Phrase Localization Without Paired Training Examples, Publisher: IEEE COMPUTER SOC

Author Web Link
Cite
Citations: 29

Working paper

Lala C, Madhyastha P, Specia L, 2019, Grounded Word Sense Translation, Proceedings of the Second Workshop on Shortcomings in Vision and Language, Publisher: Association for Computational Linguistics

Conference paper

Chow J, Specia L, Madhyastha P, 2019, WMDO: Fluency-based Word Mover's Distance for Machine Translation Evaluation., Proceedings of the Fourth Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 - Volume 2: Shared Task Papers, Day 1, Pages: 494-500

Conference paper

Caglayan O, Wu Z, Madhyastha P, Wang J, Specia Let al., 2019, Imperial College London Submission to VATEX Video Captioning Task.

Working paper

Li Z, Specia L, 2019, A Comparison on Fine-grained Pre-trained Embeddings for the WMT19 Chinese-English News Translation Task, 4th Conference on Machine Translation (WMT), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 249-256

Conference paper

Alva-Manchego F, Martin L, Scarton C, Specia Let al., 2019, EASSE: Easier Automatic Sentence Simplification Evaluation, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, Pages: 49-54

Author Web Link
Cite
Citations: 11

Journal article

Wu Z, Ive J, Wang J, Madhyastha P, Specia Let al., 2019, Predicting Actions to Help Predict Translations.

Working paper

Wang Z, Ive J, Velupillai S, Specia Let al., 2019, Is artificial data useful for biomedical Natural Language Processing algorithms?, SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), Pages: 240-249

Author Web Link
Cite
Citations: 4

Journal article

Professor Lucia Specia

Contact

Location

Summary