Publications

Bodenham D, Adams N, 2023, Dean Bodenham and Niall Adams's contribution to the Discussion of 'the Discussion Meeting on Probabilistic and statistical aspects of machine learning', Journal of the Royal Statistical Society Series B: Statistical Methodology, ISSN: 1369-7412

Journal article

Bodenham D, 2023, eummd: Efficient Univariate Maximum Mean Discrepancy

Computes maximum mean discrepancy two-sample test for univariate data using the Laplacian kernel, as described in Bodenham and Kawahara (2023) <doi:10.1007/s11222-023-10271-x>. The p-value is computed using permutations. Also includes implementation for computing the robust median difference statistic 'Q_n' from Croux and Rousseeuw (1992) <doi:10.1007/978-3-662-26811-7_58> based on Johnson and Mizoguchi (1978) <doi:10.1137/0207013>.

Abstract
Cite

Software

Bodenham DA, Kawahara Y, 2023, euMMD: efficiently computing the MMD two-sample test statistic for univariate data, Statistics and Computing, Vol: 33, ISSN: 0960-3174

The maximum mean discrepancy (MMD) test is a nonparametric kernelised two-sample test that, when using a characteristic kernel, can detect any distributional change between two samples. However, when the total number of d-dimensional observations is n, direct computation of the test statistic is O(dn2). While approximations with lower computational complexity are known, more efficient methods for computing the exact test statistic are unknown. This paper provides an exact method for computing the MMD test statistic for the univariate case in O(nlogn) using the Laplacian kernel. Furthermore, this exact method is extended to an approximate method for d-dimensional real-valued data also with complexity log-linear in the number of observations. Experiments show that this approximate method can have good statistical performance when compared to the exact test, particularly in cases where d>n.

Journal article

Llinares-López F, Papaxanthos L, Roqueiro D, Bodenham D, Borgwardt Ket al., 2019, CASMAP: detection of statistically significant combinations of SNPs in association mapping, Bioinformatics, Vol: 35, Pages: 2680-2682, ISSN: 1367-4803

Combinatorial association mapping aims to assess the statistical association of higher-order interactions of genetic markers with a phenotype of interest. This article presents combinatorial association mapping (CASMAP), a software package that leverages recent advances in significant pattern mining to overcome the statistical and computational challenges that have hindered combinatorial association mapping. CASMAP can be used to perform region-based association studies and to detect higher-order epistatic interactions of genetic variants. Most importantly, unlike other existing significant pattern mining-based tools, CASMAP allows for the correction of categorical covariates such as age or gender, making it suitable for genome-wide association studies.

Journal article

Bodenham DA, Adams NM, 2017, Continuous monitoring for changepoints in data streams using adaptive estimation, Statistics and Computing, Vol: 27, Pages: 1257-1270, ISSN: 0960-3174

Data streams are characterised by a potentially unending sequence of high-frequency observations which are subject to unknown temporal variation. Many modern streaming applications demand the capability to sequentially detect changes as soon as possible after they occur, while continuing to monitor the stream as it evolves. We refer to this problem as continuous monitoring. Sequential algorithms such as CUSUM, EWMA and their more sophisticated variants usually require a pair of parameters to be selected for practical application. However, the choice of parameter values is often based on the anticipated size of the changes and a given choice is unlikely to be optimal for the multiple change sizes which are likely to occur in a streaming data context. To address this critical issue, we introduce a changepoint detection framework based on adaptive forgetting factors that, instead of multiple control parameters, only requires a single parameter to be selected. Simulated results demonstrate that this framework has utility in a continuous monitoring setting. In particular, it reduces the burden of selecting parameters in advance. Moreover, the methodology is demonstrated on real data arising from Foreign Exchange markets.

Journal article

Llinares-López F, Papaxanthos L, Bodenham D, Roqueiro D, Borgwardt Ket al., 2017, Genome-wide genetic heterogeneity discovery with categorical covariates, Bioinformatics, Vol: 33, Pages: 1820-1828, ISSN: 1367-4803

MotivationGenetic heterogeneity is the phenomenon that distinct genetic variants may give rise to the same phenotype. The recently introduced algorithm Fast Automatic Interval Search (FAIS) enables the genome-wide search of candidate regions for genetic heterogeneity in the form of any contiguous sequence of variants, and achieves high computational efficiency and statistical power. Although FAIS can test all possible genomic regions for association with a phenotype, a key limitation is its inability to correct for confounders such as gender or population structure, which may lead to numerous false-positive associations.ResultsWe propose FastCMH, a method that overcomes this problem by properly accounting for categorical confounders, while still retaining statistical power and computational efficiency. Experiments comparing FastCMH with FAIS and multiple kinds of burden tests on simulated data, as well as on human and Arabidopsis samples, demonstrate that FastCMH can drastically reduce genomic inflation and discover associations that are missed by standard burden tests.Availability and ImplementationAn R package fastcmh is available on CRAN and the source code can be found at: https://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/fastcmh.html

Journal article

Llinares-López F, Grimm DG, Bodenham DA, Gieraths U, Sugiyama M, Rowan B, Borgwardt Ket al., 2015, Genome-wide detection of intervals of genetic heterogeneity associated with complex traits, Publisher: Oxford University Press (OUP), Pages: i240-i249, ISSN: 1367-4803

<jats:title>Abstract</jats:title> <jats:p>Motivation: Genetic heterogeneity, the fact that several sequence variants give rise to the same phenotype, is a phenomenon that is of the utmost interest in the analysis of complex phenotypes. Current approaches for finding regions in the genome that exhibit genetic heterogeneity suffer from at least one of two shortcomings: (i) they require the definition of an exact interval in the genome that is to be tested for genetic heterogeneity, potentially missing intervals of high relevance, or (ii) they suffer from an enormous multiple hypothesis testing problem due to the large number of potential candidate intervals being tested, which results in either many false positives or a lack of power to detect true intervals.</jats:p> <jats:p>Results: Here, we present an approach that overcomes both problems: it allows one to automatically find all contiguous sequences of single nucleotide polymorphisms in the genome that are jointly associated with the phenotype. It also solves both the inherent computational efficiency problem and the statistical problem of multiple hypothesis testing, which are both caused by the huge number of candidate intervals. We demonstrate on Arabidopsis thaliana genome-wide association study data that our approach can discover regions that exhibit genetic heterogeneity and would be missed by single-locus mapping.</jats:p> <jats:p>Conclusions: Our novel approach can contribute to the genome-wide discovery of intervals that are involved in the genetic heterogeneity underlying complex phenotypes.</jats:p> <jats:p>Availability and implementation: The code can be obtained at: http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/sis.html.</jats:p> <jats:p>Contact: felipe.llinares@bsse.ethz.ch</jats:p> <jats:p>Supplementary

Conference paper

Adams NM, Bodenham DA, 2015, A comparison of efficient approximations for a weighted sum of chi-squared random variables, Statistics and Computing, Vol: 26, Pages: 917-928, ISSN: 0960-3174

In many applications, the cumulative distribution function (cdf) FQNFQN of a positively weighted sum of N i.i.d. chi-squared random variables QNQN is required. Although there is no known closed-form solution for FQNFQN, there are many good approximations. When computational efficiency is not an issue, Imhof’s method provides a good solution. However, when both the accuracy of the approximation and the speed of its computation are a concern, there is no clear preferred choice. Previous comparisons between approximate methods could be considered insufficient. Furthermore, in streaming data applications where the computation needs to be both sequential and efficient, only a few of the available methods may be suitable. Streaming data problems are becoming ubiquitous and provide the motivation for this paper. We develop a framework to enable a much more extensive comparison between approximate methods for computing the cdf of weighted sums of an arbitrary random variable. Utilising this framework, a new and comprehensive analysis of four efficient approximate methods for computing FQNFQN is performed. This analysis procedure is much more thorough and statistically valid than previous approaches described in the literature. A surprising result of this analysis is that the accuracy of these approximate methods increases with N.

Journal article

Bodenham DA, Adams NM, 2014, Adaptive change detection for relay-like behaviour, IEEE Joint Intelligence and Security Informatics Conference (JISIC 2014), Publisher: IEEE, Pages: 252-255

Author Web Link
Cite
Citations: 1

Conference paper

Bodenham DA, Adams NM, 2013, Continuous monitoring of a computer network using multivariate adaptive estimation, IEEE 13th International Conference on Data Mining (ICDM), Publisher: IEEE, Pages: 311-318, ISSN: 2375-9232

Author Web Link
Cite
Citations: 10

Conference paper

DrDeanBodenham

Contact

Location

Summary