Publications

Conference paper

Sanguedolce G, Naylor PA, Geranmayeh F, 2023,

Uncovering the potential for a weakly supervised end-to-end model in recognising speech from patient with post-stroke aphasia

, 5th Clinical Natural Language Processing Workshop, Publisher: Association for Computational Linguistics, Pages: 182-190

Post-stroke speech and language deficits (aphasia) significantly impact patients' quality of life. Many with mild symptoms remain undiagnosed, and the majority do not receive the intensive doses of therapy recommended, due to healthcare costs and/or inadequate services. Automatic Speech Recognition (ASR) may help overcome these difficulties by improving diagnostic rates and providing feedback during tailored therapy. However, its performance is often unsatisfactory due to the high variability in speech errors and scarcity of training datasets. This study assessed the performance of Whisper, a recently released end-to-end model, in patients with post-stroke aphasia (PWA). We tuned its hyperparameters to achieve the lowest word error rate (WER) on aphasic speech. WER was significantly higher in PWA compared to age-matched controls (10.3% vs 38.5%, p < 0.001). We demonstrated that worse WER was related to the more severe aphasia as measured by expressive (overt naming, and spontaneous speech production) and receptive (written and spoken comprehension) language assessments. Stroke lesion size did not affect the performance of Whisper. Linear mixed models accounting for demographic factors, therapy duration, and time since stroke, confirmed worse Whisper performance with left hemispheric frontal lesions. We discuss the implications of these findings for how future ASR can be improved in PWA.

Abstract
Cite

Conference paper

Gaskell A, Miao Y, Toni F, Specia Let al., 2022,

Logically consistent adversarial attacks for soft theorem provers

, 31st International Joint Conference on Artificial Intelligence and the 25th European Conference on Artificial Intelligence, Publisher: International Joint Conferences on Artificial Intelligence, Pages: 4129-4135

Recent efforts within the AI community haveyielded impressive results towards “soft theoremproving” over natural language sentences using lan-guage models. We propose a novel, generativeadversarial framework for probing and improvingthese models’ reasoning capabilities. Adversarialattacks in this domain suffer from the logical in-consistency problem, whereby perturbations to theinput may alter the label. Our Logically consis-tent AdVersarial Attacker, LAVA, addresses this bycombining a structured generative process with asymbolic solver, guaranteeing logical consistency.Our framework successfully generates adversarialattacks and identifies global weaknesses commonacross multiple target models. Our analyses revealnaive heuristics and vulnerabilities in these mod-els’ reasoning capabilities, exposing an incompletegrasp of logical deduction under logic programs.Finally, in addition to effective probing of thesemodels, we show that training on the generatedsamples improves the target model’s performance.

Journal article

Liu Z, Peach R, Lawrance E, Noble A, Ungless M, Barahona Met al., 2021,

Listening to mental health crisis needs at scale: using Natural Language Processing to understand and evaluate a mental health crisis text messaging service

, Frontiers in Digital Health, Vol: 3, Pages: 1-14, ISSN: 2673-253X

The current mental health crisis is a growing public health issue requiring a large-scale response that cannot be met with traditional services alone. Digital support tools are proliferating, yet most are not systematically evaluated, and we know little about their users and their needs. Shout is a free mental health text messaging service run by the charity Mental Health Innovations, which provides support for individuals in the UK experiencing mental or emotional distress and seeking help. Here we study a large data set of anonymised text message conversations and post-conversation surveys compiled through Shout. This data provides an opportunity to hear at scale from those experiencing distress; to better understand mental health needs for people not using traditional mental health services; and to evaluate the impact of a novel form of crisis support. We use natural language processing (NLP) to assess the adherence of volunteers to conversation techniques and formats, and to gain insight into demographic user groups and their behavioural expressions of distress. Our textual analyses achieve accurate classification of conversation stages (weighted accuracy = 88%), behaviours (1-hamming loss = 95%) and texter demographics (weighted accuracy = 96%), exemplifying how the application of NLP to frontline mental health data sets can aid with post-hoc analysis and evaluation of quality of service provision in digital mental health services.

Conference paper

Kotonya N, Spooner T, Magazzeni D, Toni Fet al., 2021,

Graph reasoning with context-aware linearization for interpretable fact extraction and verification

, FEVER 2021, Publisher: Association for Computational Linguistics, Pages: 21-30

This paper presents an end-to-end system for fact extraction and verification using textual and tabular evidence, the performance of which we demonstrate on the FEVEROUS dataset. We experiment with both a multi-task learning paradigm to jointly train a graph attention network for both the task of evidence extraction and veracity prediction, as well as a single objective graph model for solely learning veracity prediction and separate evidence extraction. In both instances, we employ a framework for per-cell linearization of tabular evidence, thus allowing us to treat evidence from tables as sequences. The templates we employ for linearizing tables capture the context as well as the content of table data. We furthermore provide a case study to show the interpretability our approach. Our best performing system achieves a FEVEROUS score of 0.23 and 53% label accuracy on the blind test data.

Conference paper

Zylberajch H, Lertvittayakumjorn P, Toni F, 2021,

HILDIF: interactive debugging of NLI models using influence functions

, 1st Workshop on Interactive Learning for Natural Language Processing (InterNLP), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 1-6

Biases and artifacts in training data can cause unwelcome behavior in text classifiers (such as shallow pattern matching), leading to lack of generalizability. One solution to this problem is to include users in the loop and leverage their feedback to improve models. We propose a novel explanatory debugging pipeline called HILDIF, enabling humans to improve deep text classifiers using influence functions as an explanation method. We experiment on the Natural Language Inference (NLI) task, showing that HILDIF can effectively alleviate artifact problems in fine-tuned BERT models and result in increased model generalizability.

Journal article

Lertvittayakumjorn P, Toni F, 2021,

Explanation-based human debugging of nlp models: a survey

, Transactions of the Association for Computational Linguistics, Vol: 9, Pages: 1508-1528, ISSN: 2307-387X

Debugging a machine learning model is hard since the bug usually involves the training data and the learning process. This becomes even harder for an opaque deep learning model if we have no clue about how the model actually works. In this survey, we review papers that exploit explanations to enable humans to give feedback and debug NLP models. We call this problem explanation-based human debugging (EBHD). In particular, we categorize and discuss existing work along three dimensions of EBHD (the bug context, the workflow, and the experimental setting), compile findings on how EBHD components affect the feedback providers, and highlight open problems that could be future research directions.

Conference paper

Kotonya N, Toni F, 2020,

Explainable Automated Fact-Checking: A Survey

, Barcelona. Spain, 28th International Conference on Computational Linguistics (COLING 2020), Publisher: International Committee on Computational Linguistics, Pages: 5430-5443

A number of exciting advances have been made in automated fact-checkingthanks to increasingly larger datasets and more powerful systems, leading toimprovements in the complexity of claims which can be accurately fact-checked.However, despite these advances, there are still desirable functionalitiesmissing from the fact-checking pipeline. In this survey, we focus on theexplanation functionality -- that is fact-checking systems providing reasonsfor their predictions. We summarize existing methods for explaining thepredictions of fact-checking systems and we explore trends in this topic.Further, we consider what makes for good explanations in this specific domainthrough a comparative analysis of existing fact-checking explanations againstsome desirable properties. Finally, we propose further research directions forgenerating fact-checking explanations, and describe how these may lead toimprovements in the research area.v

Conference paper

Kotonya N, Toni F, 2020,

Explainable Automated Fact-Checking for Public Health Claims

, 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP(1) 2020), Publisher: ACL, Pages: 7740-7754

Fact-checking is the task of verifying the veracity of claims by assessing their assertions against credible evidence. The vast major-ity of fact-checking studies focus exclusively on political claims. Very little research explores fact-checking for other topics, specifically subject matters for which expertise is required. We present the first study of explainable fact-checking for claims which require specific expertise. For our case study we choose the setting of public health. To support this case study we construct a new datasetPUBHEALTHof 11.8K claims accompanied by journalist crafted, gold standard explanations(i.e., judgments) to support the fact-check la-bels for claims1. We explore two tasks: veracity prediction and explanation generation. We also define and evaluate, with humans and computationally, three coherence properties of explanation quality. Our results indicate that,by training on in-domain data, gains can be made in explainable, automated fact-checking for claims which require specific expertise.

Conference paper

Lertvittayakumjorn P, Specia L, Toni F, 2020,

FIND: Human-in-the-loop debugging deep text classifiers

, 2020 Conference on Empirical Methods in Natural Language Processing, Publisher: ACL

Since obtaining a perfect training dataset (i.e., a dataset which is considerably large, unbiased, and well-representative of unseen cases)is hardly possible, many real-world text classifiers are trained on the available, yet imperfect, datasets. These classifiers are thus likely to have undesirable properties. For instance, they may have biases against some sub-populations or may not work effectively in the wild due to overfitting. In this paper, we propose FIND–a framework which enables humans to debug deep learning text classifiers by disabling irrelevant hidden features. Experiments show that by using FIND, humans can improve CNN text classifiers which were trained under different types of imperfect datasets (including datasets with biases and datasets with dissimilar train-test distributions).

Abstract
Cite

Book chapter

Cocarascu O, Toni F, 2020,

Deploying Machine Learning Classifiers for Argumentative Relations “in the Wild”

, Argumentation Library, Pages: 269-285

Argument Mining (AM) aims at automatically identifying arguments and components of arguments in text, as well as at determining the relations between these arguments, on various annotated corpora using machine learning techniques (Lippi & Torroni, 2016).

Abstract
Cite

Conference paper

Cocarascu O, Cabrio E, Villata S, Toni Fet al., 2020,

Dataset Independent Baselines for Relation Prediction in Argument Mining.

, Publisher: IOS Press, Pages: 45-52

Journal article

Oehmichen A, Hua K, Lopez JAD, Molina-Solana M, Gomez-Romero J, Guo Yet al., 2019,

Not all lies are equal. A study into the engineering of political misinformation in the 2016 US presidential election

, IEEE Access, Vol: 7, Pages: 126305-126314, ISSN: 2169-3536

We investigated whether and how political misinformation is engineered using a datasetof four months worth of tweets related to the 2016 presidential election in the United States. The datacontained tweets that achieved a significant level of exposure and was manually labelled into misinformationand regular information. We found that misinformation was produced by accounts that exhibit differentcharacteristics and behaviour from regular accounts. Moreover, the content of misinformation is more novel,polarised and appears to change through coordination. Our findings suggest that engineering of politicalmisinformation seems to exploit human traits such as reciprocity and confirmation bias. We argue thatinvestigating how misinformation is created is essential to understand human biases, diffusion and ultimatelybetter produce public policy.

Abstract
Cite

Conference paper

Lertvittayakumjorn P, Toni F, 2019,

Human-grounded evaluations of explanation methods for text classification

, 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Publisher: ACL Anthology, Pages: 5195-5205

Due to the black-box nature of deep learning models, methods for explaining the models’ results are crucial to gain trust from humans and support collaboration between AIsand humans. In this paper, we consider several model-agnostic and model-specific explanation methods for CNNs for text classification and conduct three human-grounded evaluations, focusing on different purposes of explanations: (1) revealing model behavior, (2)justifying model predictions, and (3) helping humans investigate uncertain predictions.The results highlight dissimilar qualities of thevarious explanation methods we consider andshow the degree to which these methods couldserve for each purpose.

Journal article

Čyras K, Birch D, Guo Y, Toni F, Dulay R, Turvey S, Greenberg D, Hapuarachchi Tet al., 2019,

Explanations by arbitrated argumentative dispute

, Expert Systems with Applications, Vol: 127, Pages: 141-156, ISSN: 0957-4174

Explaining outputs determined algorithmically by machines is one of the most pressing and studied problems in Artificial Intelligence (AI) nowadays, but the equally pressing problem of using AI to explain outputs determined by humans is less studied. In this paper we advance a novel methodology integrating case-based reasoning and computational argumentation from AI to explain outcomes, determined by humans or by machines, indifferently, for cases characterised by discrete (static) features and/or (dynamic) stages. At the heart of our methodology lies the concept of arbitrated argumentative disputesbetween two fictitious disputants arguing, respectively, for or against a case's output in need of explanation, and where this case acts as an arbiter. Specifically, in explaining the outcome of a case in question, the disputants put forward as arguments relevant cases favouring their respective positions, with arguments/cases conflicting due to their features, stages and outcomes, and the applicability of arguments/cases arbitrated by the features and stages of the case in question. We in addition use arbitrated dispute trees to identify the excess features that help the winning disputant to win the dispute and thus complement the explanation. We evaluate our novel methodology theoretically, proving desirable properties thereof, and empirically, in the context of primary legislation in the United Kingdom (UK), concerning the passage of Bills that may or may not become laws. High-level factors underpinning a Bill's passage are its content-agnostic features such as type, number of sponsors, ballot order, as well as the UK Parliament's rules of conduct. Given high numbers of proposed legislation (hundreds of Bills a year), it is hard even for legal experts to explain on a large scale why certain Bills pass or not. We show how our methodology can address this problem by automatically providing high-level explanations of why Bills pass or not, based on the given Bills and the

Abstract
Cite

Journal article

Schaub MT, Delvenne JC, Lambiotte R, Barahona Met al., 2019,

Multiscale dynamical embeddings of complex networks

, Physical Review E, Vol: 99, Pages: 062308-1-062308-18, ISSN: 1539-3755

Complex systems and relational data are often abstracted as dynamical processes on networks. To understand, predict, and control their behavior, a crucial step is to extract reduced descriptions of such networks. Inspired by notions from control theory, we propose a time-dependent dynamical similarity measure between nodes, which quantifies the effect a node-input has on the network. This dynamical similarity induces an embedding that can be employed for several analysis tasks. Here we focus on (i) dimensionality reduction, i.e., projecting nodes onto a low-dimensional space that captures dynamic similarity at different timescales, and (ii) how to exploit our embeddings to uncover functional modules. We exemplify our ideas through case studies focusing on directed networks without strong connectivity and signed networks. We further highlight how certain ideas from community detection can be generalized and linked to control theory, by using the here developed dynamical perspective.

Journal article

Altuncu MT, Mayer E, Yaliraki SN, Barahona Met al., 2019,

From free text to clusters of content in health records: An unsupervised graph partitioning approach

, Applied Network Science, Vol: 4, ISSN: 2364-8228

Electronic Healthcare records contain large volumes of unstructured data in different forms. Free text constitutes a large portion of such data, yet this source of richly detailed information often remains under-used in practice because of a lack of suitable methodologies to extract interpretable contentin a timely manner. Here we apply network-theoretical tools to the analysis of free text in Hospital Patient Incident reports in the English National Health Service, to find clusters of reports in an unsupervised manner and at different levels of resolution based directly on the free text descriptions contained within them. To do so, we combine recently developed deep neural network text-embedding methodologies based on paragraph vectors with multi-scale Markov Stability community detection applied to a similarity graph of documents obtained from sparsified text vector similarities. We showcase the approach with the analysis of incident reports submitted in Imperial College Healthcare NHS Trust, London. The multiscale community structure reveals levels of meaning with different resolution in the topics of the dataset, as shown by relevant descriptive terms extracted from thegroups of records, as well as by comparing a posteriori against hand-coded categories assigned by healthcare personnel. Our content communities exhibit good correspondence with well-defined hand-coded categories, yet our results also provide further medical detail in certain areas as well asrevealing complementary descriptors of incidents beyond the external classification. We also discuss how the method can be used to monitor reports over time and across different healthcare providers, and to detect emerging trends that fall outside of pre-existing categories.

Abstract
Cite

Conference paper

Kotonya N, Toni F, 2019,

Gradual Argumentation Evaluation for Stance Aggregation in Automated Fake News Detection

, 6th Workshop on Argument Mining (ArgMining), Publisher: ASSOC COMPUTATIONAL LINGUISTICS-ACL, Pages: 156-166

Author Web Link
Cite
Citations: 16

Journal article

Cocarascu O, Toni F, 2018,

Combining deep learning and argumentative reasoning for the analysis of social media textual content using small datasets

, Computational Linguistics, Vol: 44, Pages: 833-858, ISSN: 0891-2017

The use of social media has become a regular habit for many and has changed the way people interact with each other. In this article, we focus on analysing whether news headlines support tweets and whether reviews are deceptive by analysing the interaction or the influence that these texts have on the others, thus exploiting contextual information. Concretely, we define a deep learning method for Relation-based Argument Mining to extract argumentative relations of attack and support. We then use this method for determining whether news articles support tweets, a useful task in fact-checking settings, where determining agreement towards a statement is a useful step towards determining its truthfulness. Furthermore we use our method for extracting Bipolar Argumentation Frameworks from reviews to help detect whether they are deceptive. We show experimentally that our method performs well in both settings. In particular, in the case of deception detection, our method contributes a novel argumentative feature that, when used in combination with other features in standard supervised classifiers, outperforms the latter even on small datasets.

Abstract
Cite

Journal article

Amador Diaz Lopez JC, Collignon-Delmar S, Benoit K, Matsuo Aet al., 2017,

Predicting the Brexit Vote by Tracking and Classifying Public Opinion Using Twitter Data

, Statistics, Politics and Policy, Vol: 8, ISSN: 2151-7509

We use 23M Tweets related to the EU referendum in the UK to predict the Brexit vote. In particular, we use user-generated labels known as hashtags to build training sets related to the Leave/Remain campaign. Next, we train SVMs in order to classify Tweets. Finally, we compare our results to Internet and telephone polls. This approach not only allows to reduce the time of hand-coding data to create a training set, but also achieves high level of correlations with Internet polls. Our results suggest that Twitter data may be a suitable substitute for Internet polls and may be a useful complement for telephone polls. We also discuss the reach and limitations of this method.

Abstract
Cite

Conference paper

Cocarascu O, Toni F, 2017,

Identifying attack and support argumentative relations using deep learning

, 2017 Conference on Empirical Methods in Natural Language Processing, Publisher: Association for Computational Linguistics, Pages: 1374-1379

We propose a deep learning architecture tocapture argumentative relations ofattackandsupportfrom one piece of text to an-other, of the kind that naturally occur ina debate. The architecture uses two (uni-directional or bidirectional) Long Short-Term Memory networks and (trained ornon-trained) word embeddings, and al-lows to considerably improve upon exist-ing techniques that use syntactic featuresand supervised classifiers for the sameform of (relation-based) argument mining.

Search or filter publications

Filter by type:

Filter by year:

Results

Search results

FIND: Human-in-the-loop debugging deep text classifiers

Dataset Independent Baselines for Relation Prediction in Argument Mining.

Gradual Argumentation Evaluation for Stance Aggregation in Automated Fake News Detection

Contact the group