113 results found
Sanna Passino F, Adams N, Cohen E, et al., 2023, Statistical cybersecurity: a brief discussion of challenges, data structures, and future directions, Harvard Data Science Review, Vol: 5, Pages: 1-10, ISSN: 2644-2353
Hallgren KL, Heard NA, Adams NM, 2022, Changepoint detection in non-exchangeable data, Statistics and Computing, Vol: 32, Pages: 1-19, ISSN: 0960-3174
Changepoint models typically assume the data within each segment are independent and identically distributed conditional on some parameters that change across segments. This construction may be inadequate when data are subject to local correlation patterns, often resulting in many more changepoints fitted than preferable. This article proposes a Bayesian changepoint model that relaxes the assumption of exchangeability within segments. The proposed model supposes data within a segment are m-dependent for some unknown m⩾0 that may vary between segments, resulting in a model suitable for detecting clear discontinuities in data that are subject to different local temporal correlations. The approach is suited to both continuous and discrete data. A novel reversible jump Markov chain Monte Carlo algorithm is proposed to sample from the model; in particular, a detailed analysis of the parameter space is exploited to build proposals for the orders of dependence. Two applications demonstrate the benefits of the proposed model: computer network monitoring via change detection in count data, and segmentation of financial time series.
Shlomovich L, Cohen E, Adams N, 2022, A parameter estimation method for multivariate binned Hawkes processes, Statistics and Computing, Vol: 32, ISSN: 0960-3174
It is often assumed that events cannot occur simultaneously when modelling data with pointprocesses. This raises a problem as real-world dataoften contains synchronous observations due to aggregation or rounding, resulting from limitations onrecording capabilities and the expense of storing highvolumes of precise data. In order to gain a better understanding of the relationships between processes,we consider modelling the aggregated event data using multivariate Hawkes processes, which offer a description of mutually-exciting behaviour and havefound wide applications in areas including seismology and finance. Here we generalise existing methodology on parameter estimation of univariate aggregated Hawkes processes to the multivariate case using a Monte Carlo Expectation-Maximization (MCEM) algorithm and through a simulation study illustrate that alternative approaches to this problemcan be severely biased, with the multivariate MCEM method outperforming them in terms of MSE inall considered cases.
A key difficulty that arises from real event data is imprecision in the recording of event time-stamps. In many cases, retaining event times with a high precision is expensive due to the sheer volume of activity. Combined with practical limits on the accuracy of measurements, binned data is common. In order to use point processes to model such event data, tools for handling parameter estimation are essential. Here we consider parameter estimation of the Hawkes process, a type of self-exciting point process that has found application in the modeling of financial stock markets, earthquakes and social media cascades. We develop a novel optimization approach to parameter estimation of binned Hawkes processes using a modified Expectation-Maximization algorithm, referred to as Binned Hawkes Expectation Maximization (BH-EM). Through a detailed simulation study, we demonstrate that existing methods are capable of producing severely biased and highly variable parameter estimates and that our novel BH-EM method significantly outperforms them in all studied circumstances. We further illustrate the performance on network flow (NetFlow) data between devices in a real large-scale computer network, to characterize triggering behavior. These results highlight the importance of correct handling of binned data.
Yahze L, Adams N, Bellotti A, 2022, A relabeling approach to handling the class imbalance problem for logistic regression, Journal of Computational and Graphical Statistics, Vol: 31, Pages: 241-253, ISSN: 1061-8600
Logistic regression is a standard procedure for real-world classification problems. The challenge of class imbalance arises in two-class classification problems when the minority class is observed much less than the majority class. This characteristic is endemic in many domains. Work by Owen  has shown that cluster structure among the minority class may be a specific problem in highly imbalanced logistic regression. In this paper, we propose a novel relabeling approach to handle the class imbalance problem when using logistic regression, which essentially assigns new labels to the minority class observations. An Expectation-Maximization algorithm is formalized to serve as a tool for efficiently computing this relabeling. Modeling on such relabeled data can lead to improved predictive performance. We demonstrate the effectiveness of this approach with detailed experiments on real data sets.
Holtegebaum H, Adams N, Lau D-H, 2021, Unsupervised streaming anomaly detection for instrumented infrastructure, Annals of Applied Statistics, Vol: 15, Pages: 1101-1125, ISSN: 1932-6157
Structural Health Monitoring (SHM) often involves instrumenting structures with distributed sensor networks. These networks typically provide high frequency data describing the spatio-temporal behaviour of the assets. A main objective of SHM is to reason about changes in structures’ behaviour using sensor data. We construct a streaming anomaly detection method for data from a railway bridge instrumented with a fibre-optic sensor network. The data exhibits trend over time, which may be partially attributable to environmental factors, calling for temporally adaptive estimation. Exploiting a latent structure present in the data motivates a quantity of interest for anomaly detection. This quantity is estimated sequentially and adaptively using a new formulation of streaming Principal Component Analysis. Anomaly detection for this quantity is then provided using Conformal Prediction. Like all streaming methods, the pro-posed method has free control parameters which are set using simulations based on bridge data. Experiments demonstrate that this method can operate at the sampling frequency of the data while providing accurate tracking of the target quantity. Further, the anomaly detection is able to detect train passage events. Finally the method reveals a previously unreported cyclic structure present in the data.
Adams N, Riddle-Workman E, Evangelou M, 2021, Multi-Type relational clustering for enterprise cyber-security networks, Pattern Recognition Letters, Vol: 149, Pages: 172-178, ISSN: 0167-8655
Several cyber-security data sources are collected in enterprise networks providing relational information between different types of nodes in the network, namely computers, users and ports. This relational data can be expressed as adjacency matrices detailing inter-type relationships corresponding to relations between nodes of different types and intra-type relationships showing relationships between nodes of the same type. In this paper, we propose an extension of Non-Negative Matrix Tri-Factorisation (NMTF) to simultaneously cluster nodes based on their intra and inter-type relationships. Existing NMTF based clustering methods suffer from long computational times due to large matrix multiplications. In our approach, we enforce stricter cluster indicator constraints on the factor matrices to circumvent these issues. Additionally, to make our proposed approach less susceptible to variation in results due to random initialisation, we propose a novel initialisation procedure based on Non-Negative Double Singular Value Decomposition for multi-type relational clustering. Finally, a new performance measure suitable for assessing clustering performance on unlabelled multi-type relational data sets is presented. Our algorithm is assessed on both a simulated and real computer network against standard approaches showing its strong performance.
Plasse J, Helfer Hoeltgebaum H, Adams N, 2021, Streaming changepoint detection for transition matrices, Data Mining and Knowledge Discovery, Vol: 35, Pages: 1287-1316, ISSN: 1384-5810
Sequentially detecting multiple changepoints in a data stream is a challenging task. Difficulties relate to both computational and statistical aspects, and in the latter, specifying control parameters is a particular problem. Choosing control parameters typically relies on unrealistic assumptions, such as the distributions generating the data, and their parameters, being known. This is implausible in the streaming paradigm, where several changepoints will exist. Further, current literature is mostly concerned with streams of continuous-valued observations, and focuses on detecting a single changepoint. There is a dearth of literature dedicated to detecting multiple changepoints in transition matrices, which arise from a sequence of discrete states. This paper makes the following contributions: a complete framework is developed for adaptively and sequentially estimating a Markov transition matrix in the streaming data setting. A change detection method is then developed, using a novel moment matching technique, which can effectively monitor for multiple changepoints in a transition matrix. This adaptive detection and estimation procedure for transition matrices, referred to as ADEPT-M, is compared to several change detectors on synthetic data streams, and is implemented on two real-world data streams – one consisting of over nine million HTTP web requests, and the other being a well-studied electricity market data set.
Plasse J, Hoeltgebaum H, Adams NM, 2021, Streaming changepoint detection for transition matrices (Apr, 10.1007/s10618-021-00747-7, 2021), Data Mining and Knowledge Discovery, Vol: 35, Pages: 1-1, ISSN: 1384-5810
Mikhailova A, Adams N, Hallsworth C, et al., 2021, Unsupervised deep learning-powered anomaly detection for instrumented infrastructure, Proceedings of the Institution of Civil Engineers - Smart Infrastructure and Construction, Vol: 172, Pages: 135-147, ISSN: 2397-8759
Deep learning methods have recently shown great success in numerous fields, including finance, healthcare, linguistics, robotics and even cybersports. Unsupervised learning methods identify the dominant patterns of variability that shape a data set. Such patterns may correspond to well-understood processes, previously unknown clusters or anomalies. This paper presents a case study where a state-of-the-art family of unsupervised deep learning models called variational autoencoder (VAE) is applied to data accrued from a network of fibre-optic sensors installed within a composite steel–concrete half-through railway bridge. The goals were (a) to characterise automatically the behaviour of the bridge based on sensor measurements and, (b) based on this characterisation, to determine when a train passes across a bridge. Based on the VAE model, an algorithm is presented to identify automatically the ‘train event’ points in an unsupervised setting. Two architectures for the VAE model are compared with commonly used baselines. The architecture tailored for modelling sequential data is shown to outperform other methods considered, on both seen and unseen data. No special hyperparameter optimisation is required. This study illustrates how state-of-the-art deep learning methods can be applied to a civil infrastructure engineering problem without directly modelling the physics of the objects or performing tedious hyperparameter optimisation.
Helfer Hoeltgebaum H, Adams N, Fernandes C, 2021, Estimation, forecasting and anomaly detection for nonstationary streams using adaptive estimation, IEEE Transactions on Cybernetics, Vol: 52, Pages: 7956-7967, ISSN: 1083-4419
Streaming data provides substantial challenges for data analysis. From a computational standpoint, these challenges arise from constraints related to computer memory and processing speed. Statistically, the challenges relate to constructing procedures that can handle the so-called concept drift--the tendency of future data to have different underlying properties to current and historic data. The issue of handling structure, such as trend and periodicity, remains a difficult problem for streaming estimation. We propose the real-time adaptive component (RAC), a penalized-regression modeling framework that satisfies the computational constraints of streaming data, and provides the capability for dealing with concept drift. At the core of the estimation process are techniques from adaptive filtering. The RAC procedure adopts a specified basis to handle local structure, along with a least absolute shrinkage operator-like penalty procedure to handle over fitting. We enhance the RAC estimation procedure with a streaming anomaly detection capability. The experiments with simulated data suggest the procedure can be considered as a competitive tool for a variety of scenarios, and an illustration with real cyber-security data further demonstrates the promise of the method.
Ward S, Cohen E, Adams N, 2021, Testing for complete spatial randomness on three dimensional bounded convex shapes, Spatial Statistics, Vol: 41, ISSN: 2211-6753
There is currently a gap in theory for point patterns that lie on the surface of objects, with researchers focusing on patterns that lie in a Euclidean space, typically planar and spatial data. Methodology for planar and spatial data thus relies on Euclidean geometry and is therefore inappropriate for analysis of point patterns observed in non-Euclidean spaces. Recently, there has been extensions to the analysis of point patterns on a sphere, however, many other shapes are left unexplored. This is in part due to the challenge of defining the notion of stationarity for a point process existing on such a space due to the lack of rotational and translational isometries. Here, we construct functional summary statistics for Poisson processes defined on convex shapes in three dimensions. Using the Mapping Theorem, a Poisson process can be transformed from any convex shape to a Poisson process on the unit sphere which has rotational symmetries that allow for functional summary statistics to be constructed. We present the first and second order properties of such summary statistics and demonstrate how they can be used to construct a test statistics to determine whether an observed pattern exhibits complete spatial randomness or spatial preference on the original convex space. We compare this test statistic with one constructed from an analogue L-function for inhomogeneous point processes on the sphere. A study of the Type I and II errors of our test statistics are explored through simulations on ellipsoids of varying dimensions.
Evangelou M, Adams N, 2020, An anomaly detection framework for cyber-security data, Computers and Security, Vol: 97, Pages: 1-10, ISSN: 0167-4048
Data-driven anomaly detection systems unrivalled potential as complementary defence systems to existing signature-based tools as the number of cyber attacks increases. In this manuscript an anomaly detection system is presented that detects any abnormal deviations from the normal behaviour of an individual device. Device behaviour is defined as the number of network traffic events involving the device of interest observed within a pre-specified time period. The behaviour of each device at normal state is modelled to depend on its observed historic behaviour. A number of statistical and machine learning approaches are explored for modelling this relationship and through a comparative study, the Quantile Regression Forests approach is found to have the best predictive power. Based on the prediction intervals of the Quantile Regression Forests an anomaly detection system is proposed that characterises as abnormal, any observed behaviour outside of these intervals. A series of experiments for contaminating normal device behaviour are presented for examining the performance of the anomaly detection system. Through the conducted analysis the proposed anomaly detection system is found to outperform two other detection systems. The presented work has been conducted on two enterprise networks.
Bakoben M, Bellotti A, Adams N, 2020, Identification of credit risk based on cluster analysis of account behaviours, Journal of the Operational Research Society, Vol: 71, Pages: 775-783, ISSN: 0160-5682
Assessment of risk levels for existing credit accounts isimportant to the implementation of bank policies and offeringfinancial products.This paper uses cluster analysis of be-haviour of credit card accounts to help assess credit risk level.Account behaviour is modelled parametrically and we thenimplement the behavioural cluster analysis using a recentlyproposed dissimilarity measure of statistical model parameters.The advantage of this new measure is the explicit exploitationof uncertainty associated with parameters estimated fromstatistical models.Interesting clusters of real credit cardbehaviours data are obtained, in addition to superior predictionand forecasting of account default based on the clusteringoutcomes.
Li Y, Bellotti A, Adams N, 2019, Issues using logistic regression with class imbalance, with a case study from credit risk modelling, Foundations of Data Science
The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than themajority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen  has shown that, ina theoretical context related to infinite imbalance, logistic regression behavesin such a way that all data in the rare class can be replaced by their meanvector to achieve the same coefficient estimates. We build on Owen’s results toshow the phenomenon remains true for both weighted and penalized likelihoodmethods. Such results suggest that problems may occur if there is structurewithin the rare class that is not captured by the mean vector. We demonstratethis problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logisticregression is not able to provide the best out-of-sample predictive performanceand that an approach that is able to model underlying structure in the minorityclass is often superior.
Plasse J, Adams N, 2019, Multiple changepoint detection in categorical data streams, Statistics and Computing, Vol: 29, Pages: 1109-1125, ISSN: 0960-3174
The need for efficient tools is pressing in the era of big data, particularly in streaming data applications. As data streams are ubiquitous, the ability to accurately detect multiple changepoints, without affecting the continuous flow of data, is an important issue. Change detection for categorical data streams is understudied, and existing work commonly introduces fixed control parameters while providing little insight into how they may be chosen. This is ill-suited to the streaming paradigm, motivating the need for an approach that introduces few parameters which may be set without requiring any prior knowledge of the stream. This paper introduces such a method, which can accurately detect changepoints in categorical data streams with fixed storage and computational requirements. The detector relies on the ability to adaptively monitor the category probabilities of a multinomial distribution, where temporal adaptivity is introduced using forgetting factors. A novel adaptive threshold is also developed which can be computed given a desired false positive rate. This method is then compared to sequential and nonsequential change detectors in a large simulation study which verifies the usefulness of our approach. A real data set consisting of nearly 40 million events from a computer network is also investigated.
Lu X, Adams N, Kantas N, 2019, On adaptive estimation for dynamic Bernoulli bandits, Foundations of Data Science, Vol: 1, Pages: 197-225, ISSN: 2639-8001
The multi-armed bandit (MAB) problem is a classic example of theexploration-exploitation dilemma. It is concerned with maximising the totalrewards for a gambler by sequentially pulling an arm from a multi-armed slotmachine where each arm is associated with a reward distribution. In staticMABs, the reward distributions do not change over time, while in dynamic MABs,each arm's reward distribution can change, and the optimal arm can switch overtime. Motivated by many real applications where rewards are binary, we focus ondynamic Bernoulli bandits. Standard methods like $\epsilon$-Greedy and UpperConfidence Bound (UCB), which rely on the sample mean estimator, often fail totrack changes in the underlying reward for dynamic problems. In this paper, weovercome the shortcoming of slow response to change by deploying adaptiveestimation in the standard methods and propose a new family of algorithms,which are adaptive versions of $\epsilon$-Greedy, UCB, and Thompson sampling.These new methods are simple and easy to implement. Moreover, they do notrequire any prior knowledge about the dynamic reward process, which isimportant for real applications. We examine the new algorithms numerically indifferent scenarios and the results show solid improvements of our algorithmsin dynamic environments.
Lau D-H, Adams NM, 2019, The importance of analysing data from instrumented infrastructure, International Conference on Smart Infrastructure and Construction
Mikhailova A, Adams N, Hallsworth C, et al., 2019, Unsupervised deep learning for instrumented infrastructure: a case study, International Conference on Smart Infrastructure and Construction
Ward S, Cohen E, Adams N, 2019, Fusing multimodal microscopy data for improved cell boundary estimation and fluorophore localization of Pseudomonas aeruginosa, Asilomar Conference on Signals, Systems and Computers, Publisher: IEEE
With advances in experimental technologies, the use of biological imaging has grown rapidly and there is need for procedures to combine data arising from different modalities. We propose a procedure to combine yellow fluorescence protein excitation and differential interference contrast microscopy time lapse videos to better estimate the cellular boundary of Pseudomonas aeruginosa (P. aeruginosa) and localization of it's type VI secretion system (T6SS). By approximating the shape by an ellipse, we construct a penalized objective function which accounts for both sources; the minimum of which provides an elliptical approximation to their cellular boundaries. Our approach suggests improved localization of the T6SS on the estimated cell boundary of P. aeruginosa constructed using both sources of data compared to using each in isolation.
Evangelou M, 2019, An anomaly detection framework for cyber-security data, Publisher: NA
Data-driven anomaly detection systems unrivalled potential as complementary defence systems to existing signature-based tools as the number of cyber attacks in- creases. In this manuscript we present an anomaly detection system that detects any abnormal deviations from the normal behaviour of an individual device. Device behaviour is defined as the number of network traffic events involving the device of interest observed within a pre-specified time period. The behaviour of each device at normal state is modelled to depend on its observed historic behaviour. A number of statistical and machine learning approaches are explored for modelling this rela- tionship and through a comparative study, the Quantile Regression Forests approach is found to have the best predictive power. Based on the prediction intervals of the Quantile Regression Forests an anomaly detection system is proposed that charac- terises as abnormal, any observed behaviour outside of these intervals. Through a series of experiments the proposed anomaly detection system is found to outper- form two other detection systems. The presented work has been conducted on two enterprise networks.
Riddle-Workman E, Evangelou M, Adams N, 2018, Adaptive anomaly detection on network data streams, IEEE Conference on Intelligence and Security Informatics (ISI) 2018, Publisher: IEEE
As the number of cyber-attacks increases, there hasbeen increasing emphasis on developing complementary methodsof detection to the existing signature-based approaches. This workbuilds upon a previously discovered persistent structure withinthe Los Alamos National Laboratory network data sources,to develop a regression based streaming anomaly detectionmechanism that can adapt to the network behaviour over time.The methodology has also been applied to a new data set of thesame network to assess the extent of its pertinence in time.
Hogan J, Adams NM, 2018, A Study of Data Fusion for Predicting Novel Activity in Enterprise Cyber-Security, 16th Annual IEEE International Conference on Intelligence and Security Informatics (IEEE ISI), Publisher: IEEE, Pages: 37-42
Heard N, Adams N, Rubin-Delanchy P, 2018, Data Science for Cyber-Security, Publisher: Wspc (Europe), ISBN: 9781786345639
This book provides insight into a range of data science techniques for addressing these pressing concerns.The application of statistical and broader data science techniques provides an exciting growth area in the design of cyber defences.
Noble J, Adams N, 2018, Real-time dynamic network anomaly detection, IEEE Intelligent Systems, Vol: 33, Pages: 5-18, ISSN: 1541-1672
Methodology for statistical analysis of enterprise network data is becoming increasingly important in cyber-security. Thevolume and velocity of enterprise network data sources puts a premium on streaming analytics that pass over the data once, whilehandling temporal variation in the process. In this paper we introduce ReTiNA: a framework for streaming network anomaly detection.This procedure first detects anomalies in the correlation processes on individual edges of the network graph. Second, anomaliesacross multiple edges are combined and scored to give network-wide situational awareness. The approach is tested in simulation anddemonstrated on two real Netflow datasets.
Lau D, Adams N, Girolami M, et al., 2018, The role of statistics in data-centric engineering, Statistics and Probability Letters, Vol: 136, Pages: 58-62, ISSN: 0167-7152
We explore the role of statistics for Big Data analysis arising from the emerging eld of Data-Centric Engineering. Using examples related to sensor-instrumentedbridges, we highlight a number of issues and challenges. These are broadly cate-gorised as relating to uncertainty, latent-structure modelling, and the synthesisof statistical models and abstract physical models.Keywords: Big Data, Data-Centric Engineering, Digital Twin, Fibre-opticsensor, Instrumented infrastructure, Statistics
Lau D, Butler L, Adams N, et al., 2018, Real-time Statistical Modelling of Data Generated from Self-Sensing Bridges, Proceedings of the Institution of Civil Engineers - Civil Engineering, ISSN: 0965-089X
Hogan J, Cohen EAK, Adams NM, 2017, Devising a fairer method for adjusting target scores in interrupted one-day international cricket, Electronic Journal of Applied Statistical Analysis, Vol: 10, Pages: 745-758, ISSN: 2070-5948
One-day international cricket matches face the problem of weather inter-ruption. In such circumstances, a so-called rain rule is used to decide theoutcome. A variety of approaches for constructing such rules has been pro-posed, with the Duckworth-Lewis method being preferred in the sport. Thereare a number of issues to consider in reasoning about the e↵ectiveness of arain rule, notably accuracy (does the rule make the right decision?) andfairness (are both teams treated equally?). We develop an approach that isa hybrid of resource-based and so-called probability-preserving approachesand provide empirical evidence that this hybrid method is superior in termsof fairness while competitive in terms of accuracy.
Bodenham DA, Adams NM, 2017, Continuous monitoring for changepoints in data streams using adaptive estimation, Statistics and Computing, Vol: 27, Pages: 1257-1270, ISSN: 0960-3174
Data streams are characterised by a potentially unending sequence of high-frequency observations which are subject to unknown temporal variation. Many modern streaming applications demand the capability to sequentially detect changes as soon as possible after they occur, while continuing to monitor the stream as it evolves. We refer to this problem as continuous monitoring. Sequential algorithms such as CUSUM, EWMA and their more sophisticated variants usually require a pair of parameters to be selected for practical application. However, the choice of parameter values is often based on the anticipated size of the changes and a given choice is unlikely to be optimal for the multiple change sizes which are likely to occur in a streaming data context. To address this critical issue, we introduce a changepoint detection framework based on adaptive forgetting factors that, instead of multiple control parameters, only requires a single parameter to be selected. Simulated results demonstrate that this framework has utility in a continuous monitoring setting. In particular, it reduces the burden of selecting parameters in advance. Moreover, the methodology is demonstrated on real data arising from Foreign Exchange markets.
Schon C, Adams NM, Evangelou M, 2017, Clustering and monitoring edge behaviour in enterprise network traffic, IEEE International Conference on Intelligence and Security Informatics, Publisher: IEEE, Pages: 31-36
This paper takes an unsupervised learning approach for monitoring edge activity within an enterprise computer network. Using NetFlow records, features are gathered across the active connections (edges) in 15-minute time windows. Then, edges are grouped into clusters using the k-means algorithm. This process is repeated over contiguous windows. A series of informative indicators are derived by examining the relationship of edges with the observed cluster structure. This leads to an intuitive method for monitoring network behaviour and a temporal description of edge behaviour at global and local levels.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.