Imperial College London

Professor Lucia Specia

Faculty of EngineeringDepartment of Computing

Chair in Natural Language Processing
 
 
 
//

Contact

 

l.specia Website

 
 
//

Location

 

572aHuxley BuildingSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

155 results found

Specia L, Wang J, Lee SJ, Ostapenko A, Madhyastha Pet al., 2021, Read, spot and translate, MACHINE TRANSLATION, ISSN: 0922-6567

Journal article

Citamak B, Caglayan O, Kuyu M, Erdem E, Erdem A, Madhyastha P, Specia Let al., 2020, MSVD-Turkish: A comprehensive multimodal dataset for integrated vision and language research in Turkish, Publisher: arXiv

Automatic generation of video descriptions in natural language, also calledvideo captioning, aims to understand the visual content of the video andproduce a natural language sentence depicting the objects and actions in thescene. This challenging integrated vision and language problem, however, hasbeen predominantly addressed for English. The lack of data and the linguisticproperties of other languages limit the success of existing approaches for suchlanguages. In this paper we target Turkish, a morphologically rich andagglutinative language that has very different properties compared to English.To do so, we create the first large scale video captioning dataset for thislanguage by carefully translating the English descriptions of the videos in theMSVD (Microsoft Research Video Description Corpus) dataset into Turkish. Inaddition to enabling research in video captioning in Turkish, the parallelEnglish-Turkish descriptions also enables the study of the role of videocontext in (multimodal) machine translation. In our experiments, we buildmodels for both video captioning and multimodal machine translation andinvestigate the effect of different word segmentation approaches and differentneural architectures to better address the properties of Turkish. We hope thatthe MSVD-Turkish dataset and the results reported in this work will lead tobetter video captioning and multimodal machine translation models for Turkishand other morphology rich and agglutinative languages.

Working paper

Zhong Y, Xie L, Wang S, Specia L, Miao Yet al., 2020, Watch and learn: mapping language and noisy real-world videos withself-supervision, Publisher: arXiv

In this paper, we teach machines to understand visuals and natural languageby learning the mapping between sentences and noisy video snippets withoutexplicit annotations. Firstly, we define a self-supervised learning frameworkthat captures the cross-modal information. A novel adversarial learning moduleis then introduced to explicitly handle the noises in the natural videos, wherethe subtitle sentences are not guaranteed to be strongly corresponded to thevideo snippets. For training and evaluation, we contribute a new dataset`ApartmenTour' that contains a large number of online videos and subtitles. Wecarry out experiments on the bidirectional retrieval tasks between sentencesand videos, and the results demonstrate that our proposed model achieves thestate-of-the-art performance on both retrieval tasks and exceeds several strongbaselines. The dataset will be released soon.

Working paper

Caglayan O, Madhyastha P, Specia L, 2020, Curious case of language generation evaluation metrics: a cautionary tale, Publisher: arXiv

Automatic evaluation of language generation systems is a well-studied problemin Natural Language Processing. While novel metrics are proposed every year, afew popular metrics remain as the de facto metrics to evaluate tasks such asimage captioning and machine translation, despite their known limitations. Thisis partly due to ease of use, and partly because researchers expect to see themand know how to interpret them. In this paper, we urge the community for morecareful consideration of how they automatically evaluate their models bydemonstrating important failure cases on multiple datasets, language pairs andtasks. Our experiments show that metrics (i) usually prefer system outputs tohuman-authored texts, (ii) can be insensitive to correct translations of rarewords, (iii) can yield surprisingly high scores when given a single sentence assystem output for the entire test set.

Working paper

Lertvittayakumjorn P, Specia L, Toni F, 2020, FIND: human-in-the-loop debugging deep text classifiers

Since obtaining a perfect training dataset (i.e., a dataset which isconsiderably large, unbiased, and well-representative of unseen cases) ishardly possible, many real-world text classifiers are trained on the available,yet imperfect, datasets. These classifiers are thus likely to have undesirableproperties. For instance, they may have biases against some sub-populations ormay not work effectively in the wild due to overfitting. In this paper, wepropose FIND -- a framework which enables humans to debug deep learning textclassifiers by disabling irrelevant hidden features. Experiments show that byusing FIND, humans can improve CNN text classifiers which were trained underdifferent types of imperfect datasets (including datasets with biases anddatasets with dissimilar train-test distributions).

Working paper

Fomicheva M, Sun S, Fonseca E, Blain F, Chaudhary V, Guzmán F, Lopatina N, Specia L, Martins AFTet al., 2020, MLQE-PE: A multilingual quality estimation and post-editing dataset, Publisher: arXiv

We present MLQE-PE, a new dataset for Machine Translation (MT) QualityEstimation (QE) and Automatic Post-Editing (APE). The dataset contains sevenlanguage pairs, with human labels for 9,000 translations per language pair inthe following formats: sentence-level direct assessments and post-editingeffort, and word-level good/bad labels. It also contains the post-editedsentences, as well as titles of the articles where the sentences were extractedfrom, and the neural MT models used to translate the text.

Working paper

Lertvittayakumjorn P, Specia L, Toni F, 2020, FIND: Human-in-the-Loop Debugging Deep Text Classifiers, 2020 Conference on Empirical Methods in Natural Language Processing, Publisher: ACL

Since obtaining a perfect training dataset (i.e., a dataset which is considerably large, unbiased, and well-representative of unseen cases)is hardly possible, many real-world text classifiers are trained on the available, yet imperfect, datasets. These classifiers are thus likely to have undesirable properties. For instance, they may have biases against some sub-populations or may not work effectively in the wild due to overfitting. In this paper, we propose FIND–a framework which enables humans to debug deep learning text classifiers by disabling irrelevant hidden features. Experiments show that by using FIND, humans can improve CNN text classifiers which were trained under different types of imperfect datasets (including datasets with biases and datasets with dissimilar train-test distributions).

Conference paper

Fomicheva M, Sun S, Yankovskaya L, Blain F, Guzmán F, Fishel M, Aletras N, Chaudhary V, Specia Let al., 2020, Unsupervised quality estimation for neural machine translation, Transactions of the Association for Computational Linguistics, Vol: 8, Pages: 539-555, ISSN: 2307-387X

Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By employing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.

Journal article

Caglayan O, Ive J, Haralampieva V, Madhyastha P, Barrault L, Specia Let al., 2020, Simultaneous machine translation with visual context, Transactions of the Association for Computational Linguistics, Vol: 8, Pages: 539-555, ISSN: 2307-387X

Simultaneous machine translation (SiMT) aims to translate a continuous inputtext stream into another language with the lowest latency and highest qualitypossible. The translation thus has to start with an incomplete source text,which is read progressively, creating the need for anticipation. In this paper,we seek to understand whether the addition of visual information can compensatefor the missing source context. To this end, we analyse the impact of differentmultimodal approaches and visual features on state-of-the-art SiMT frameworks.Our results show that visual context is helpful and that visually-groundedmodels based on explicit object region information are much better thancommonly used global features, reaching up to 3 BLEU points improvement underlow latency scenarios. Our qualitative analysis illustrates cases where onlythe multimodal systems are able to translate correctly from English intogender-marked languages, as well as deal with differences in word order, suchas adjective-noun placement between English and French.

Journal article

Fomicheva M, Sun S, Yankovskaya L, Blain F, Guzmán F, Fishel M, Aletras N, Chaudhary V, Specia Let al., 2020, Unsupervised quality estimation for neural machine translation, Publisher: arXiv

Quality Estimation (QE) is an important component in making MachineTranslation (MT) useful in real-world applications, as it is aimed to informthe user on the quality of the MT output at test time. Existing approachesrequire large amounts of expert annotated data, computation and time fortraining. As an alternative, we devise an unsupervised approach to QE where notraining or access to additional resources besides the MT system itself isrequired. Different from most of the current work that treats the MT system asa black box, we explore useful information that can be extracted from the MTsystem as a by-product of translation. By employing methods for uncertaintyquantification, we achieve very good correlation with human judgments ofquality, rivalling state-of-the-art supervised QE models. To evaluate ourapproach we collect the first dataset that enables work on both black-box andglass-box approaches to QE.

Working paper

Specia L, Barrault L, Caglayan O, Duarte A, Elliott D, Gella S, Holzenberger N, Lala C, Lee SJ, Libovicky J, Madhyastha P, Metze F, Mulligan K, Ostapenko A, Palaskar S, Sanabria R, Wang J, Arora Ret al., 2020, Grounded Sequence to Sequence Transduction, IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, Vol: 14, Pages: 577-591, ISSN: 1932-4553

Journal article

Alva-Manchego F, Scarton C, Specia L, 2020, Data-driven sentence simplification: survey and benchmark, Computational Linguistics, Vol: 46, Pages: 135-187, ISSN: 0891-2017

Sentence Simplification (SS) aims to modify a sentence in order to make it easier to read and understand. In order to do so, several rewriting transformations can be performed such as replacement, reordering, and splitting. Executing these transformations while keeping sentences grammatical, preserving their main idea, and generating simpler output, is a challenging and still far from solved problem. In this article, we survey research on SS, focusing on approaches that attempt to learn how to simplify using corpora of aligned original-simplified sentence pairs in English, which is the dominant paradigm nowadays. We also include a benchmark of different approaches on common data sets so as to compare them and highlight their strengths and limitations. We expect that this survey will serve as a starting point for researchers interested in the task and help spark new ideas for future developments.

Journal article

Li Z, Fomicheva M, Specia L, 2020, Exploring Model Consensus to Generate Translation Paraphrases, 4th Workshop on Neural Generation and Translation, Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 161-168

Conference paper

Alva-Manchego F, Martin L, Bordes A, Scarton C, Sagot B, Specia Let al., 2020, ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), Pages: 4668-4679

Journal article

Fomincheva M, Specia L, Guzman F, 2020, Multi-Hypothesis Machine Translation Evaluation, 58th Annual Meeting of the Association-for-Computational-Linguistics (ACL), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 1218-1232

Conference paper

Okabe S, Blain F, Specia L, 2020, Multimodal Quality Estimation for Machine Translation, 58th Annual Meeting of the Association-for-Computational-Linguistics (ACL), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 1233-1240

Conference paper

Alva-Manchego F, Scarton C, Martin L, Specia Let al., 2020, EASSE: Easier automatic sentence simplification evaluation, EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Proceedings of System Demonstrations, Pages: 49-54

We introduce EASSE, a Python package aiming to facilitate and standardise automatic evaluation and comparison of Sentence Simplification (SS) systems. EASSE provides a single access point to a broad range of evaluation resources: standard automatic metrics for assessing SS outputs (e.g. SARI), word-level accuracy scores for certain simplification transformations, reference-independent quality estimation features (e.g. compression ratio), and standard test data for SS evaluation (e.g. TurkCorpus). Finally, EASSE generates easy-to-visualise reports on the various metrics and features above and on how a particular SS output fares against reference simplifications. Through experiments, we show that these functionalities allow for better comparison and understanding of the performance of SS systems.

Journal article

Ive J, Specia L, Szoc S, Vanallemeersch T, van den Bogaert J, Farah E, Maroti C, Ventura A, Khalilov Met al., 2020, A post-editing dataset in the legal domain: Do we underestimate neural machine translation quality?, Pages: 3692-3697

We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether, the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation system, a post-edited version of such translation by a professional translator, and - where available - the original reference translation crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations, even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test evaluation metrics. The data is freely available as an ELRC-SHARE resource.

Conference paper

Scarton C, Madhyastha P, Specia L, 2020, Deciding When, How and for Whom to Simplify, 24th European Conference on Artificial Intelligence (ECAI), Publisher: IOS PRESS, Pages: 2172-2179, ISSN: 0922-6389

Conference paper

Sulubacak U, Caglayan O, Grönroos S-A, Rouhe A, Elliott D, Specia L, Tiedemann Jet al., 2019, Multimodal machine translation through visuals and speech, Publisher: arXiv

Multimodal machine translation involves drawing information from more thanone modality, based on the assumption that the additional modalities willcontain useful alternative views of the input data. The most prominent tasks inthis area are spoken language translation, image-guided translation, andvideo-guided translation, which exploit audio and visual modalities,respectively. These tasks are distinguished from their monolingual counterpartsof speech recognition, image captioning, and video captioning by therequirement of models to generate outputs in a different language. This surveyreviews the major data resources for these tasks, the evaluation campaignsconcentrated around them, the state of the art in end-to-end and pipelineapproaches, and also the challenges in performance evaluation. The paperconcludes with a discussion of directions for future research in these areas:the need for more expansive and challenging datasets, for targeted evaluationsof model performance, and for multimodality in both the input and output space.

Working paper

Ive J, Madhyastha P, Specia L, 2019, Deep copycat networks for text-to-text generation., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Pages: 3225-3234

Most text-to-text generation tasks, for example text summarisation and text simplification, require copying words from the input to the output. We introduce Copycat, a transformer-based pointer network for such tasks which obtains competitive results in abstractive text summarisation and generates more abstractive summaries. We propose a further extension of this architecture for automatic post-editing, where generation is conditioned over two inputs (source language and machine translation), and the model is capable of deciding where to copy information from. This approach achieves competitive performance when compared to state-of-the-art automated post-editing systems. More importantly, we show that it addresses a well-known limitation of automatic post-editing - overcorrecting translations - and that our novel mechanism for copying source language words improves the results.

Conference paper

Li Z, Specia L, 2019, Improving neural machine translation robustness via data augmentation: beyond back-translation, Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), Publisher: Association for Computational Linguistics, Pages: 328-336

Neural Machine Translation (NMT) models have been proved strong when translating clean texts, but they are very sensitive to noise in the input. Improving NMT models robustness can be seen as a form of “domain” adaption to noise. The recently created Machine Translation on Noisy Text task corpus provides noisy-clean parallel data for a few language pairs, but this data is very limited in size and diversity. The state-of-the-art approaches are heavily dependent on large volumes of back-translated data. This paper has two main contributions: Firstly, we propose new data augmentation methods to extend limited noisy data and further improve NMT robustness to noise while keeping the models small. Secondly, we explore the effect of utilizing noise from external data in the form of speech transcripts and show that it could help robustness.

Conference paper

Fomicheva M, Specia L, 2019, Taking MT evaluation metrics to extremes: beyond correlation with human judgments, Computational Linguistics, Vol: 45, Pages: 515-558, ISSN: 0891-2017

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria.

Journal article

Wang J, Specia L, 2019, Phrase Localization Without Paired Training Examples, Publisher: IEEE COMPUTER SOC

Working paper

Wang Z, Ive J, Velupillai S, Specia Let al., 2019, Is artificial data useful for biomedical Natural Language Processing algorithms?, SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), Pages: 240-249

Journal article

Scarton C, Paetzold GH, Specia L, 2019, Text simplification from professionally produced corpora, Pages: 3504-3510

The lack of large and reliable datasets has been hindering progress in Text Simplification (TS). We investigate the application of the recently created Newsela corpus, the largest collection of professionally written simplifications available, in TS tasks. Using new alignment algorithms, we extract 550, 644 complex-simple sentence pairs from the corpus. This data is explored in different ways: (i) we show that traditional readability metrics capture surprisingly well the different complexity levels in this corpus, (ii) we build machine learning models to classify sentences into complex vs. simple and to predict complexity levels that outperform their respective baselines, (iii) we introduce a lexical simplifier that uses the corpus to generate candidate simplifications and outperforms the state of the art approaches, and (iv) we show that the corpus can be used to learn sentence simplification patterns in more effective ways than corpora used in previous work.

Conference paper

Li Z, Specia L, 2019, A Comparison on Fine-grained Pre-trained Embeddings for the WMT19 Chinese-English News Translation Task, 4th Conference on Machine Translation (WMT), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 249-256

Conference paper

Lala C, Specia L, 2019, Multimodal lexical translation, Pages: 3810-3817

Inspired by the tasks of Multimodal Machine Translation and Visual Sense Disambiguation we introduce a task called Multimodal Lexical Translation (MLT). The aim of this new task is to correctly translate an ambiguous word given its context - an image and a sentence in the source language. To facilitate the task, we introduce the MLT dataset, where each data point is a 4-tuple consisting of an ambiguous source word, its visual context (an image), its textual context (a source sentence), and its translation that conforms with the visual and textual contexts. The dataset has been created from the Multi30K corpus using word-alignment followed by human inspection for translations from English to German and English to French. We also introduce a simple heuristic to quantify the extent of the ambiguity of a word from the distribution of its translations and use it to select subsets of the MLT Dataset which are difficult to translate. These form a valuable multimodal and multilingual language resource with several potential uses including evaluation of lexical disambiguation within (Multimodal) Machine Translation systems.

Conference paper

Scarton C, Henrique Paetzold G, Specia L, 2019, Simpa: A sentence-level simplification corpus for the public administration domain, Pages: 4333-4338

We present a sentence-level simplification corpus with content from the Public Administration (PA) domain. The corpus contains 1, 100 original sentences with manual simplifications collected through a two-stage process. Firstly, annotators were asked to simplify only words and phrases (lexical simplification). Each sentence was simplified by three annotators. Secondly, one lexically simplified version of each original sentence was further simplified at the syntactic level. In its current version there are 3, 300 lexically simplified sentences plus 1, 100 syntactically simplified sentences. The corpus will be used for evaluation of text simplification approaches in the scope of the EU H2020 SIMPATICO project - which focuses on accessibility of e-services in the PA domain - and beyond. The main advantage of this corpus is that lexical and syntactic simplifications can be analysed and used in isolation. The lexically simplified corpus is also multi-reference (three different simplifications per original sentence). This is an ongoing effort and our final aim is to collect manual simplifications for the entire set of original sentences, with over 10K sentences.

Conference paper

Madhyastha P, Wang J, Specia L, 2019, End-to-end image captioning exploits multimodal distributional similarity

We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn 'distributional similarity' in a multimodal feature space by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the 'image' side of image captioning, and vary the input image representation but keep the RNN text generation component of a CNN-RNN model constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations; (ii) suffer virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together. Our findings indicate that our distributional similarity hypothesis holds. We conclude that regardless of the image representation used image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00527653&limit=30&person=true