100 results found
Li Y, Bellotti A, Adams N, Issues using logistic regression with class imbalance, with a case study from credit risk modelling, Foundations of Data Science
The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than themajority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen  has shown that, ina theoretical context related to infinite imbalance, logistic regression behavesin such a way that all data in the rare class can be replaced by their meanvector to achieve the same coefficient estimates. We build on Owen’s results toshow the phenomenon remains true for both weighted and penalized likelihoodmethods. Such results suggest that problems may occur if there is structurewithin the rare class that is not captured by the mean vector. We demonstratethis problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logisticregression is not able to provide the best out-of-sample predictive performanceand that an approach that is able to model underlying structure in the minorityclass is often superior.
Plasse J, Adams N, 2019, Multiple changepoint detection in categorical data streams, Statistics and Computing, Vol: 29, Pages: 1109-1125, ISSN: 0960-3174
The need for efficient tools is pressing in the era of big data, particularly in streaming data applications. As data streams are ubiquitous, the ability to accurately detect multiple changepoints, without affecting the continuous flow of data, is an important issue. Change detection for categorical data streams is understudied, and existing work commonly introduces fixed control parameters while providing little insight into how they may be chosen. This is ill-suited to the streaming paradigm, motivating the need for an approach that introduces few parameters which may be set without requiring any prior knowledge of the stream. This paper introduces such a method, which can accurately detect changepoints in categorical data streams with fixed storage and computational requirements. The detector relies on the ability to adaptively monitor the category probabilities of a multinomial distribution, where temporal adaptivity is introduced using forgetting factors. A novel adaptive threshold is also developed which can be computed given a desired false positive rate. This method is then compared to sequential and nonsequential change detectors in a large simulation study which verifies the usefulness of our approach. A real data set consisting of nearly 40 million events from a computer network is also investigated.
Lu X, Adams N, Kantas N, 2019, On adaptive estimation for dynamic Bernoulli bandits, Foundations of Data Science, Vol: 1, Pages: 197-225, ISSN: 2639-8001
The multi-armed bandit (MAB) problem is a classic example of theexploration-exploitation dilemma. It is concerned with maximising the totalrewards for a gambler by sequentially pulling an arm from a multi-armed slotmachine where each arm is associated with a reward distribution. In staticMABs, the reward distributions do not change over time, while in dynamic MABs,each arm's reward distribution can change, and the optimal arm can switch overtime. Motivated by many real applications where rewards are binary, we focus ondynamic Bernoulli bandits. Standard methods like $\epsilon$-Greedy and UpperConfidence Bound (UCB), which rely on the sample mean estimator, often fail totrack changes in the underlying reward for dynamic problems. In this paper, weovercome the shortcoming of slow response to change by deploying adaptiveestimation in the standard methods and propose a new family of algorithms,which are adaptive versions of $\epsilon$-Greedy, UCB, and Thompson sampling.These new methods are simple and easy to implement. Moreover, they do notrequire any prior knowledge about the dynamic reward process, which isimportant for real applications. We examine the new algorithms numerically indifferent scenarios and the results show solid improvements of our algorithmsin dynamic environments.
Bakoben M, Bellotti A, Adams N, Identification of credit risk based on cluster analysis of account behaviours, Journal of the Operational Research Society, ISSN: 0160-5682
Assessment of risk levels for existing credit accounts isimportant to the implementation of bank policies and offeringfinancial products.This paper uses cluster analysis of be-haviour of credit card accounts to help assess credit risk level.Account behaviour is modelled parametrically and we thenimplement the behavioural cluster analysis using a recentlyproposed dissimilarity measure of statistical model parameters.The advantage of this new measure is the explicit exploitationof uncertainty associated with parameters estimated fromstatistical models.Interesting clusters of real credit cardbehaviours data are obtained, in addition to superior predictionand forecasting of account default based on the clusteringoutcomes.
Lau D-H, Adams NM, The importance of analysing data from instrumented infrastructure, International Conference on Smart Infrastructure and Construction
Mikhailova A, Adams N, Hallsworth C, et al., Unsupervised deep learning for instrumented infrastructure: a case study, International Conference on Smart Infrastructure and Construction
Ward S, Cohen E, Adams N, 2019, Fusing multimodal microscopy data for improved cell boundary estimation and fluorophore localization of Pseudomonas aeruginosa, Asilomar Conference on Signals, Systems and Computers, Publisher: IEEE
With advances in experimental technologies, the use of biological imaging has grown rapidly and there is need for procedures to combine data arising from different modalities. We propose a procedure to combine yellow fluorescence protein excitation and differential interference contrast microscopy time lapse videos to better estimate the cellular boundary of Pseudomonas aeruginosa (P. aeruginosa) and localization of it's type VI secretion system (T6SS). By approximating the shape by an ellipse, we construct a penalized objective function which accounts for both sources; the minimum of which provides an elliptical approximation to their cellular boundaries. Our approach suggests improved localization of the T6SS on the estimated cell boundary of P. aeruginosa constructed using both sources of data compared to using each in isolation.
Evangelou M, 2019, An anomaly detection framework for cyber-security data, Publisher: NA
Data-driven anomaly detection systems unrivalled potential as complementary defence systems to existing signature-based tools as the number of cyber attacks in- creases. In this manuscript we present an anomaly detection system that detects any abnormal deviations from the normal behaviour of an individual device. Device behaviour is defined as the number of network traffic events involving the device of interest observed within a pre-specified time period. The behaviour of each device at normal state is modelled to depend on its observed historic behaviour. A number of statistical and machine learning approaches are explored for modelling this rela- tionship and through a comparative study, the Quantile Regression Forests approach is found to have the best predictive power. Based on the prediction intervals of the Quantile Regression Forests an anomaly detection system is proposed that charac- terises as abnormal, any observed behaviour outside of these intervals. Through a series of experiments the proposed anomaly detection system is found to outper- form two other detection systems. The presented work has been conducted on two enterprise networks.
Hogan J, Adams NM, 2018, A Study of Data Fusion for Predicting Novel Activity in Enterprise Cyber-Security, 16th Annual IEEE International Conference on Intelligence and Security Informatics (IEEE ISI), Publisher: IEEE, Pages: 37-42
Riddle-Workman E, Evangelou M, Adams N, Adaptive Anomaly Detection on Network Data Streams, IEEE Conference on Intelligence and Security Informatics (ISI) 2018, Publisher: IEEE
As the number of cyber-attacks increases, there hasbeen increasing emphasis on developing complementary methodsof detection to the existing signature-based approaches. This workbuilds upon a previously discovered persistent structure withinthe Los Alamos National Laboratory network data sources,to develop a regression based streaming anomaly detectionmechanism that can adapt to the network behaviour over time.The methodology has also been applied to a new data set of thesame network to assess the extent of its pertinence in time.
Heard N, Adams N, Rubin-Delanchy P, 2018, Data Science for Cyber-Security, Publisher: Wspc (Europe), ISBN: 9781786345639
This book provides insight into a range of data science techniques for addressing these pressing concerns.The application of statistical and broader data science techniques provides an exciting growth area in the design of cyber defences.
Noble J, Adams N, 2018, Real-time dynamic network anomaly detection, IEEE Intelligent Systems, Vol: 33, Pages: 5-18, ISSN: 1541-1672
Methodology for statistical analysis of enterprise network data is becoming increasingly important in cyber-security. Thevolume and velocity of enterprise network data sources puts a premium on streaming analytics that pass over the data once, whilehandling temporal variation in the process. In this paper we introduce ReTiNA: a framework for streaming network anomaly detection.This procedure first detects anomalies in the correlation processes on individual edges of the network graph. Second, anomaliesacross multiple edges are combined and scored to give network-wide situational awareness. The approach is tested in simulation anddemonstrated on two real Netflow datasets.
Lau D, Adams N, Girolami M, et al., 2018, The role of statistics in data-centric engineering, Statistics and Probability Letters, Vol: 136, Pages: 58-62, ISSN: 0167-7152
We explore the role of statistics for Big Data analysis arising from the emerging eld of Data-Centric Engineering. Using examples related to sensor-instrumentedbridges, we highlight a number of issues and challenges. These are broadly cate-gorised as relating to uncertainty, latent-structure modelling, and the synthesisof statistical models and abstract physical models.Keywords: Big Data, Data-Centric Engineering, Digital Twin, Fibre-opticsensor, Instrumented infrastructure, Statistics
Lau D, Butler L, Adams N, et al., Real-time Statistical Modelling of Data Generated from Self-Sensing Bridges, Proceedings of the Institution of Civil Engineers - Civil Engineering, ISSN: 0965-089X
Hogan J, Cohen EAK, Adams NM, 2017, Devising a fairer method for adjusting target scores in interrupted one-day international cricket, Electronic Journal of Applied Statistical Analysis, Vol: 10, Pages: 745-758, ISSN: 2070-5948
One-day international cricket matches face the problem of weather inter-ruption. In such circumstances, a so-called rain rule is used to decide theoutcome. A variety of approaches for constructing such rules has been pro-posed, with the Duckworth-Lewis method being preferred in the sport. Thereare a number of issues to consider in reasoning about the e↵ectiveness of arain rule, notably accuracy (does the rule make the right decision?) andfairness (are both teams treated equally?). We develop an approach that isa hybrid of resource-based and so-called probability-preserving approachesand provide empirical evidence that this hybrid method is superior in termsof fairness while competitive in terms of accuracy.
Schon C, Adams NM, Evangelou M, Clustering and monitoring edge behaviour in enterprise network traffic, IEEE International Conference on Intelligence and Security Informatics, Publisher: IEEE
This paper takes an unsupervised learning approachfor monitoring edge activity within an enterprise computernetwork. Using NetFlow records, features are gathered acrossthe active connections (edges) in 15-minute time windows.Then, edges are grouped into clusters using the k-meansalgorithm. This process is repeated over contiguous windows.A series of informative indicators are derived by examining therelationship of edges with the observed cluster structure. Thisleads to an intuitive method for monitoring network behaviourand a temporal description of edge behaviour at global andlocal levels.
Rubin-Delanchy P, HEARD NA, 2016, Disassortativity of computer networks, IEEE International Conference on Intelligence and Security Informatics, Publisher: IEEE
Network data is ubiquitous in cyber-security applications. Accurately modelling such data allows discovery of anomalous edges, subgraphs or paths, and is key to many signature-free cyber-security analytics. We present a recurring property of graphs originating from cyber-security applications, often considered a ‘corner case’ in the main literature on network data analysis, that greatly affects the performance of standard ‘off-the-shelf’ techniques. This is the property that similarity, in terms of network behaviour, does not imply connectivity, and in fact the reverse is often true. We call this disassortivity. The phenomenon is illustrated using network flow data collected on an enterprise network, and we show how Big Data analytics designed to detect unusual connectivity patterns can be improved.
Evangelou M, Adams N, 2016, Predictability of NetFlow data, IEEE International Conference on Intelligence and Security Informatics, Publisher: IEEE
The behaviour of individual devices connected to anenterprise network can vary dramatically, as a device’s activitydepends on the user operating the device as well as on all behindthe scenes operations between the device and the network. Beingable to understand and predict a device’s behaviour in a networkcan work as the foundation of an anomaly detection framework,as devices may show abnormal activity as part of a cyber attack.The aim of this work is the construction of a predictive regressionmodel for a device’s behaviour at normal state. The behaviourof a device is presented by a quantitative response and modelledto depend on historic data recorded by NetFlow.
Whitehouse M, Evangelou M, Adams N, 2016, Activity-based temporal anomaly detection in enterprise-cyber security, IEEE International Big Data Analytics for Cybersecurity computing (BDAC'16) Workshop, IEEE International Conference on Intelligence and Security Informatics, Publisher: IEEE
Statistical anomaly detection is emerging as animportant complement to signature-based methods for enterprisenetwork defence. In this paper, we isolate a persistent structurein two different enterprise network data sources. This structureprovides the basis of a regression-based anomaly detectionmethod. The procedure is demonstrated on a large public domaindata set.
Bodenham DA, Adams NM, 2016, Continuous monitoring for changepoints in data streams using adaptive estimation, Statistics and Computing, Vol: 27, Pages: 1257-1270, ISSN: 1573-1375
Data streams are characterised by a potentially unending sequence of high-frequency observations which are subject to unknown temporal variation. Many modern streaming applications demand the capability to sequentially detect changes as soon as possible after they occur, while continuing to monitor the stream as it evolves. We refer to this problem as continuous monitoring. Sequential algorithms such as CUSUM, EWMA and their more sophisticated variants usually require a pair of parameters to be selected for practical application. However, the choice of parameter values is often based on the anticipated size of the changes and a given choice is unlikely to be optimal for the multiple change sizes which are likely to occur in a streaming data context. To address this critical issue, we introduce a changepoint detection framework based on adaptive forgetting factors that, instead of multiple control parameters, only requires a single parameter to be selected. Simulated results demonstrate that this framework has utility in a continuous monitoring setting. In particular, it reduces the burden of selecting parameters in advance. Moreover, the methodology is demonstrated on real data arising from Foreign Exchange markets.
Adams N, Heard N, 2016, Dynamic networks and cyber-security, ISBN: 9781786340740
© 2016 by World Scientific Publishing Europe Ltd. All rights reserved. As an under-studied area of academic research, the analysis of computer network traffic data is still in its infancy. However, the challenge of detecting and mitigating malicious or unauthorised behaviour through the lens of such data is becoming an increasingly prominent issue. This collection of papers by leading researchers and practitioners synthesises cutting-edge work in the analysis of dynamic networks and statistical aspects of cyber security. The book is structured in such a way as to keep security application at the forefront of discussions. It offers readers easy access into the area of data analysis for complex cyber-security applications, with a particular focus on temporal and network aspects. Chapters can be read as standalone sections and provide rich reviews of the latest research within the field of cyber-security. Academic readers will benefit from state-of-the-art descriptions of new methodologies and their extension to real practical problems while industry professionals will appreciate access to more advanced methodology than ever before.
Bakoben M, Bellotti AG, Adams NM, 2016, Improving clustering performance by incorporating uncertainty, Pattern Recognition Letters, Vol: 77, Pages: 28-34, ISSN: 1872-7344
In more challenging problems the input to a clustering problem is not raw data objects, but rather parametric statistical summaries of the data objects. For example, time series of different lengths may be clustered on the basis of estimated parameters from autoregression models. Such summary procedures usually provide estimates of uncertainty for parameters, and ignoring this source of uncertainty affects the recovery of the true clusters. This paper is concerned with the incorporation of this source of uncertainty in the clustering procedure. A new dissimilarity measure is developed based on geometric overlap of confidence ellipsoids implied by the uncertainty estimates. In extensive simulation studies and a synthetic time series benchmark dataset, this new measure is shown to yield improved performance over standard approaches.
Noble J, Adams NM, 2016, Correlation-based Streaming Anomaly Detection in Cyber-Security, 16th IEEE International Conference on Data Mining (ICDM), Publisher: IEEE, Pages: 311-318, ISSN: 2375-9232
Plasse J, Adams N, 2016, Handling Delayed Labels in Temporally Evolving Data Streams, 4th IEEE International Conference on Big Data (Big Data), Publisher: IEEE, Pages: 2416-2424
Bakoben M, Adams N, Bellotti A, 2016, Uncertainty aware clustering for behaviour in enterprise networks, 16th IEEE International Conference on Data Mining (ICDM), Publisher: IEEE, Pages: 269-272, ISSN: 2375-9232
Adams NM, Bodenham DA, 2015, A comparison of efficient approximations for a weighted sum of chi-squared random variables, Statistics and Computing, Vol: 26, Pages: 917-928, ISSN: 0960-3174
In many applications, the cumulative distribution function (cdf) FQNFQN of a positively weighted sum of N i.i.d. chi-squared random variables QNQN is required. Although there is no known closed-form solution for FQNFQN, there are many good approximations. When computational efficiency is not an issue, Imhof’s method provides a good solution. However, when both the accuracy of the approximation and the speed of its computation are a concern, there is no clear preferred choice. Previous comparisons between approximate methods could be considered insufficient. Furthermore, in streaming data applications where the computation needs to be both sequential and efficient, only a few of the available methods may be suitable. Streaming data problems are becoming ubiquitous and provide the motivation for this paper. We develop a framework to enable a much more extensive comparison between approximate methods for computing the cdf of weighted sums of an arbitrary random variable. Utilising this framework, a new and comprehensive analysis of four efficient approximate methods for computing FQNFQN is performed. This analysis procedure is much more thorough and statistically valid than previous approaches described in the literature. A surprising result of this analysis is that the accuracy of these approximate methods increases with N.
Weston DJ, Russell RA, Batty E, et al., 2015, New quantitative approaches reveal the spatial preference of nuclear compartments in mammalian fibroblasts, JOURNAL OF THE ROYAL SOCIETY INTERFACE, Vol: 12, ISSN: 1742-5689
Hand DJ, Adams NM, 2014, Selection bias in credit scorecard evaluation, JOURNAL OF THE OPERATIONAL RESEARCH SOCIETY, Vol: 65, Pages: 408-415, ISSN: 0160-5682
Bodenham DA, Adams NM, 2014, Adaptive change detection for relay-like behaviour, IEEE Joint Intelligence and Security Informatics Conference (JISIC 2014), Publisher: IEEE, Pages: 252-255
Adams N, Heard N, 2014, Data analysis for network cyber-security, ISBN: 9781783263745
© 2014 by Imperial College Press. There is increasing pressure to protect computer networks against unauthorized intrusion, and some work in this area is concerned with engineering systems that are robust to attack. However, no system can be made invulnerable. Data Analysis for Network Cyber-Security focuses on monitoring and analyzing network traffic data, with the intention of preventing, or quickly identifying, malicious activity. Such work involves the intersection of statistics, data mining and computer science. Fundamentally, network traffic is relational, embodying a link between devices. As such, graph analysis approaches are a natural candidate. However, such methods do not scale well to the demands of real problems, and the critical aspect of the timing of communications events is not accounted for in these approaches. This book gathers papers from leading researchers to provide both background to the problems and a description of cutting-edge methodology. The contributors are from diverse institutions and areas of expertise and were brought together at a workshop held at the University of Bristol in March 2013 to address the issues of network cyber security. The workshop was supported by the Heilbronn Institute for Mathematical Research.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.