155 results found
Citamak B, Caglayan O, Kuyu M, et al., 2020, MSVD-Turkish: A comprehensive multimodal dataset for integrated vision and language research in Turkish, Publisher: arXiv
Automatic generation of video descriptions in natural language, also calledvideo captioning, aims to understand the visual content of the video andproduce a natural language sentence depicting the objects and actions in thescene. This challenging integrated vision and language problem, however, hasbeen predominantly addressed for English. The lack of data and the linguisticproperties of other languages limit the success of existing approaches for suchlanguages. In this paper we target Turkish, a morphologically rich andagglutinative language that has very different properties compared to English.To do so, we create the first large scale video captioning dataset for thislanguage by carefully translating the English descriptions of the videos in theMSVD (Microsoft Research Video Description Corpus) dataset into Turkish. Inaddition to enabling research in video captioning in Turkish, the parallelEnglish-Turkish descriptions also enables the study of the role of videocontext in (multimodal) machine translation. In our experiments, we buildmodels for both video captioning and multimodal machine translation andinvestigate the effect of different word segmentation approaches and differentneural architectures to better address the properties of Turkish. We hope thatthe MSVD-Turkish dataset and the results reported in this work will lead tobetter video captioning and multimodal machine translation models for Turkishand other morphology rich and agglutinative languages.
Zhong Y, Xie L, Wang S, et al., 2020, Watch and learn: mapping language and noisy real-world videos withself-supervision, Publisher: arXiv
In this paper, we teach machines to understand visuals and natural languageby learning the mapping between sentences and noisy video snippets withoutexplicit annotations. Firstly, we define a self-supervised learning frameworkthat captures the cross-modal information. A novel adversarial learning moduleis then introduced to explicitly handle the noises in the natural videos, wherethe subtitle sentences are not guaranteed to be strongly corresponded to thevideo snippets. For training and evaluation, we contribute a new dataset`ApartmenTour' that contains a large number of online videos and subtitles. Wecarry out experiments on the bidirectional retrieval tasks between sentencesand videos, and the results demonstrate that our proposed model achieves thestate-of-the-art performance on both retrieval tasks and exceeds several strongbaselines. The dataset will be released soon.
Caglayan O, Madhyastha P, Specia L, 2020, Curious case of language generation evaluation metrics: a cautionary tale, Publisher: arXiv
Automatic evaluation of language generation systems is a well-studied problemin Natural Language Processing. While novel metrics are proposed every year, afew popular metrics remain as the de facto metrics to evaluate tasks such asimage captioning and machine translation, despite their known limitations. Thisis partly due to ease of use, and partly because researchers expect to see themand know how to interpret them. In this paper, we urge the community for morecareful consideration of how they automatically evaluate their models bydemonstrating important failure cases on multiple datasets, language pairs andtasks. Our experiments show that metrics (i) usually prefer system outputs tohuman-authored texts, (ii) can be insensitive to correct translations of rarewords, (iii) can yield surprisingly high scores when given a single sentence assystem output for the entire test set.
Lertvittayakumjorn P, Specia L, Toni F, 2020, FIND: human-in-the-loop debugging deep text classifiers
Since obtaining a perfect training dataset (i.e., a dataset which isconsiderably large, unbiased, and well-representative of unseen cases) ishardly possible, many real-world text classifiers are trained on the available,yet imperfect, datasets. These classifiers are thus likely to have undesirableproperties. For instance, they may have biases against some sub-populations ormay not work effectively in the wild due to overfitting. In this paper, wepropose FIND -- a framework which enables humans to debug deep learning textclassifiers by disabling irrelevant hidden features. Experiments show that byusing FIND, humans can improve CNN text classifiers which were trained underdifferent types of imperfect datasets (including datasets with biases anddatasets with dissimilar train-test distributions).
Fomicheva M, Sun S, Fonseca E, et al., 2020, MLQE-PE: A multilingual quality estimation and post-editing dataset, Publisher: arXiv
We present MLQE-PE, a new dataset for Machine Translation (MT) QualityEstimation (QE) and Automatic Post-Editing (APE). The dataset contains sevenlanguage pairs, with human labels for 9,000 translations per language pair inthe following formats: sentence-level direct assessments and post-editingeffort, and word-level good/bad labels. It also contains the post-editedsentences, as well as titles of the articles where the sentences were extractedfrom, and the neural MT models used to translate the text.
Lertvittayakumjorn P, Specia L, Toni F, 2020, FIND: Human-in-the-Loop Debugging Deep Text Classifiers, 2020 Conference on Empirical Methods in Natural Language Processing, Publisher: ACL
Since obtaining a perfect training dataset (i.e., a dataset which is considerably large, unbiased, and well-representative of unseen cases)is hardly possible, many real-world text classifiers are trained on the available, yet imperfect, datasets. These classifiers are thus likely to have undesirable properties. For instance, they may have biases against some sub-populations or may not work effectively in the wild due to overfitting. In this paper, we propose FIND–a framework which enables humans to debug deep learning text classifiers by disabling irrelevant hidden features. Experiments show that by using FIND, humans can improve CNN text classifiers which were trained under different types of imperfect datasets (including datasets with biases and datasets with dissimilar train-test distributions).
Fomicheva M, Sun S, Yankovskaya L, et al., 2020, Unsupervised quality estimation for neural machine translation, Transactions of the Association for Computational Linguistics, Vol: 8, Pages: 539-555, ISSN: 2307-387X
Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By employing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.
Caglayan O, Ive J, Haralampieva V, et al., 2020, Simultaneous machine translation with visual context, Transactions of the Association for Computational Linguistics, Vol: 8, Pages: 539-555, ISSN: 2307-387X
Simultaneous machine translation (SiMT) aims to translate a continuous inputtext stream into another language with the lowest latency and highest qualitypossible. The translation thus has to start with an incomplete source text,which is read progressively, creating the need for anticipation. In this paper,we seek to understand whether the addition of visual information can compensatefor the missing source context. To this end, we analyse the impact of differentmultimodal approaches and visual features on state-of-the-art SiMT frameworks.Our results show that visual context is helpful and that visually-groundedmodels based on explicit object region information are much better thancommonly used global features, reaching up to 3 BLEU points improvement underlow latency scenarios. Our qualitative analysis illustrates cases where onlythe multimodal systems are able to translate correctly from English intogender-marked languages, as well as deal with differences in word order, suchas adjective-noun placement between English and French.
Fomicheva M, Sun S, Yankovskaya L, et al., 2020, Unsupervised quality estimation for neural machine translation, Publisher: arXiv
Quality Estimation (QE) is an important component in making MachineTranslation (MT) useful in real-world applications, as it is aimed to informthe user on the quality of the MT output at test time. Existing approachesrequire large amounts of expert annotated data, computation and time fortraining. As an alternative, we devise an unsupervised approach to QE where notraining or access to additional resources besides the MT system itself isrequired. Different from most of the current work that treats the MT system asa black box, we explore useful information that can be extracted from the MTsystem as a by-product of translation. By employing methods for uncertaintyquantification, we achieve very good correlation with human judgments ofquality, rivalling state-of-the-art supervised QE models. To evaluate ourapproach we collect the first dataset that enables work on both black-box andglass-box approaches to QE.
Alva-Manchego F, Scarton C, Specia L, 2020, Data-driven sentence simplification: survey and benchmark, Computational Linguistics, Vol: 46, Pages: 135-187, ISSN: 0891-2017
Sentence Simplification (SS) aims to modify a sentence in order to make it easier to read and understand. In order to do so, several rewriting transformations can be performed such as replacement, reordering, and splitting. Executing these transformations while keeping sentences grammatical, preserving their main idea, and generating simpler output, is a challenging and still far from solved problem. In this article, we survey research on SS, focusing on approaches that attempt to learn how to simplify using corpora of aligned original-simplified sentence pairs in English, which is the dominant paradigm nowadays. We also include a benchmark of different approaches on common data sets so as to compare them and highlight their strengths and limitations. We expect that this survey will serve as a starting point for researchers interested in the task and help spark new ideas for future developments.
Li Z, Fomicheva M, Specia L, 2020, Exploring Model Consensus to Generate Translation Paraphrases, 4th Workshop on Neural Generation and Translation, Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 161-168
Alva-Manchego F, Martin L, Bordes A, et al., 2020, ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), Pages: 4668-4679
Fomincheva M, Specia L, Guzman F, 2020, Multi-Hypothesis Machine Translation Evaluation, 58th Annual Meeting of the Association-for-Computational-Linguistics (ACL), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 1218-1232
Okabe S, Blain F, Specia L, 2020, Multimodal Quality Estimation for Machine Translation, 58th Annual Meeting of the Association-for-Computational-Linguistics (ACL), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 1233-1240
Alva-Manchego F, Scarton C, Martin L, et al., 2020, EASSE: Easier automatic sentence simplification evaluation, EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Proceedings of System Demonstrations, Pages: 49-54
We introduce EASSE, a Python package aiming to facilitate and standardise automatic evaluation and comparison of Sentence Simplification (SS) systems. EASSE provides a single access point to a broad range of evaluation resources: standard automatic metrics for assessing SS outputs (e.g. SARI), word-level accuracy scores for certain simplification transformations, reference-independent quality estimation features (e.g. compression ratio), and standard test data for SS evaluation (e.g. TurkCorpus). Finally, EASSE generates easy-to-visualise reports on the various metrics and features above and on how a particular SS output fares against reference simplifications. Through experiments, we show that these functionalities allow for better comparison and understanding of the performance of SS systems.
Ive J, Specia L, Szoc S, et al., 2020, A post-editing dataset in the legal domain: Do we underestimate neural machine translation quality?, Pages: 3692-3697
We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether, the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation system, a post-edited version of such translation by a professional translator, and - where available - the original reference translation crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations, even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test evaluation metrics. The data is freely available as an ELRC-SHARE resource.
Scarton C, Madhyastha P, Specia L, 2020, Deciding When, How and for Whom to Simplify, 24th European Conference on Artificial Intelligence (ECAI), Publisher: IOS PRESS, Pages: 2172-2179, ISSN: 0922-6389
Sulubacak U, Caglayan O, Grönroos S-A, et al., 2019, Multimodal machine translation through visuals and speech, Publisher: arXiv
Multimodal machine translation involves drawing information from more thanone modality, based on the assumption that the additional modalities willcontain useful alternative views of the input data. The most prominent tasks inthis area are spoken language translation, image-guided translation, andvideo-guided translation, which exploit audio and visual modalities,respectively. These tasks are distinguished from their monolingual counterpartsof speech recognition, image captioning, and video captioning by therequirement of models to generate outputs in a different language. This surveyreviews the major data resources for these tasks, the evaluation campaignsconcentrated around them, the state of the art in end-to-end and pipelineapproaches, and also the challenges in performance evaluation. The paperconcludes with a discussion of directions for future research in these areas:the need for more expansive and challenging datasets, for targeted evaluationsof model performance, and for multimodality in both the input and output space.
Ive J, Madhyastha P, Specia L, 2019, Deep copycat networks for text-to-text generation., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Pages: 3225-3234
Most text-to-text generation tasks, for example text summarisation and text simplification, require copying words from the input to the output. We introduce Copycat, a transformer-based pointer network for such tasks which obtains competitive results in abstractive text summarisation and generates more abstractive summaries. We propose a further extension of this architecture for automatic post-editing, where generation is conditioned over two inputs (source language and machine translation), and the model is capable of deciding where to copy information from. This approach achieves competitive performance when compared to state-of-the-art automated post-editing systems. More importantly, we show that it addresses a well-known limitation of automatic post-editing - overcorrecting translations - and that our novel mechanism for copying source language words improves the results.
Li Z, Specia L, 2019, Improving neural machine translation robustness via data augmentation: beyond back-translation, Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), Publisher: Association for Computational Linguistics, Pages: 328-336
Neural Machine Translation (NMT) models have been proved strong when translating clean texts, but they are very sensitive to noise in the input. Improving NMT models robustness can be seen as a form of “domain” adaption to noise. The recently created Machine Translation on Noisy Text task corpus provides noisy-clean parallel data for a few language pairs, but this data is very limited in size and diversity. The state-of-the-art approaches are heavily dependent on large volumes of back-translated data. This paper has two main contributions: Firstly, we propose new data augmentation methods to extend limited noisy data and further improve NMT robustness to noise while keeping the models small. Secondly, we explore the effect of utilizing noise from external data in the form of speech transcripts and show that it could help robustness.
Fomicheva M, Specia L, 2019, Taking MT evaluation metrics to extremes: beyond correlation with human judgments, Computational Linguistics, Vol: 45, Pages: 515-558, ISSN: 0891-2017
Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria.
Wang J, Specia L, 2019, Phrase Localization Without Paired Training Examples, Publisher: IEEE COMPUTER SOC
Wang Z, Ive J, Velupillai S, et al., 2019, Is artificial data useful for biomedical Natural Language Processing algorithms?, SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), Pages: 240-249
Scarton C, Paetzold GH, Specia L, 2019, Text simplification from professionally produced corpora, Pages: 3504-3510
The lack of large and reliable datasets has been hindering progress in Text Simplification (TS). We investigate the application of the recently created Newsela corpus, the largest collection of professionally written simplifications available, in TS tasks. Using new alignment algorithms, we extract 550, 644 complex-simple sentence pairs from the corpus. This data is explored in different ways: (i) we show that traditional readability metrics capture surprisingly well the different complexity levels in this corpus, (ii) we build machine learning models to classify sentences into complex vs. simple and to predict complexity levels that outperform their respective baselines, (iii) we introduce a lexical simplifier that uses the corpus to generate candidate simplifications and outperforms the state of the art approaches, and (iv) we show that the corpus can be used to learn sentence simplification patterns in more effective ways than corpora used in previous work.
Li Z, Specia L, 2019, A Comparison on Fine-grained Pre-trained Embeddings for the WMT19 Chinese-English News Translation Task, 4th Conference on Machine Translation (WMT), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 249-256
Lala C, Specia L, 2019, Multimodal lexical translation, Pages: 3810-3817
Inspired by the tasks of Multimodal Machine Translation and Visual Sense Disambiguation we introduce a task called Multimodal Lexical Translation (MLT). The aim of this new task is to correctly translate an ambiguous word given its context - an image and a sentence in the source language. To facilitate the task, we introduce the MLT dataset, where each data point is a 4-tuple consisting of an ambiguous source word, its visual context (an image), its textual context (a source sentence), and its translation that conforms with the visual and textual contexts. The dataset has been created from the Multi30K corpus using word-alignment followed by human inspection for translations from English to German and English to French. We also introduce a simple heuristic to quantify the extent of the ambiguity of a word from the distribution of its translations and use it to select subsets of the MLT Dataset which are difficult to translate. These form a valuable multimodal and multilingual language resource with several potential uses including evaluation of lexical disambiguation within (Multimodal) Machine Translation systems.
Scarton C, Henrique Paetzold G, Specia L, 2019, Simpa: A sentence-level simplification corpus for the public administration domain, Pages: 4333-4338
We present a sentence-level simplification corpus with content from the Public Administration (PA) domain. The corpus contains 1, 100 original sentences with manual simplifications collected through a two-stage process. Firstly, annotators were asked to simplify only words and phrases (lexical simplification). Each sentence was simplified by three annotators. Secondly, one lexically simplified version of each original sentence was further simplified at the syntactic level. In its current version there are 3, 300 lexically simplified sentences plus 1, 100 syntactically simplified sentences. The corpus will be used for evaluation of text simplification approaches in the scope of the EU H2020 SIMPATICO project - which focuses on accessibility of e-services in the PA domain - and beyond. The main advantage of this corpus is that lexical and syntactic simplifications can be analysed and used in isolation. The lexically simplified corpus is also multi-reference (three different simplifications per original sentence). This is an ongoing effort and our final aim is to collect manual simplifications for the entire set of original sentences, with over 10K sentences.
Madhyastha P, Wang J, Specia L, 2019, End-to-end image captioning exploits multimodal distributional similarity
We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn 'distributional similarity' in a multimodal feature space by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the 'image' side of image captioning, and vary the input image representation but keep the RNN text generation component of a CNN-RNN model constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations; (ii) suffer virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together. Our findings indicate that our distributional similarity hypothesis holds. We conclude that regardless of the image representation used image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.