153 results found
Newcombe PJ, Ali HR, Blows FM, et al., 2017, Weibull regression with Bayesian variable selection to identify prognostic tumour markers of breast cancer survival, STATISTICAL METHODS IN MEDICAL RESEARCH, Vol: 26, Pages: 414-436, ISSN: 0962-2802
As data-rich medical datasets are becoming routinely collected, there is a growing demand for regression methodology that facilitates variable selection over a large number of predictors. Bayesian variable selection algorithms offer an attractive solution, whereby a sparsity inducing prior allows inclusion of sets of predictors simultaneously, leading to adjusted effect estimates and inference of which covariates are most important. We present a new implementation of Bayesian variable selection, based on a Reversible Jump MCMC algorithm, for survival analysis under the Weibull regression model. A realistic simulation study is presented comparing against an alternative LASSO-based variable selection strategy in datasets of up to 20,000 covariates. Across half the scenarios, our new method achieved identical sensitivity and specificity to the LASSO strategy, and a marginal improvement otherwise. Runtimes were comparable for both approaches, taking approximately a day for 20,000 covariates. Subsequently, we present a real data application in which 119 protein-based markers are explored for association with breast cancer survival in a case cohort of 2287 patients with oestrogen receptor-positive disease. Evidence was found for three independent prognostic tumour markers of survival, one of which is novel. Our new approach demonstrated the best specificity.
Greene D, Richardson S, Turro E, et al., 2016, Phenotype Similarity Regression for Identifying the Genetic Determinants of Rare Diseases, AMERICAN JOURNAL OF HUMAN GENETICS, Vol: 98, Pages: 490-499, ISSN: 0002-9297
Rare genetic disorders, which can now be studied systematically with affordable genome sequencing, are often caused by high-penetrance rare variants. Such disorders are often heterogeneous and characterized by abnormalities spanning multiple organ systems ascertained with variable clinical precision. Existing methods for identifying genes with variants responsible for rare diseases summarize phenotypes with unstructured binary or quantitative variables. The Human Phenotype Ontology (HPO) allows composite phenotypes to be represented systematically but association methods accounting for the ontological relationship between HPO terms do not exist. We present a Bayesian method to model the association between an HPO-coded patient phenotype and genotype. Our method estimates the probability of an association together with an HPO-coded phenotype characteristic of the disease. We thus formalize a clinical approach to phenotyping that is lacking in standard regression techniques for rare disease research. We demonstrate the power of our method by uncovering a number of true associations in a large collection of genome-sequenced and HPO-coded cases with rare diseases.
Lewin A, Saadi H, Peters JE, et al., 2016, MT-HESS: an efficient Bayesian approach for simultaneous association detection in OMICS datasets, with application to eQTL mapping in multiple tissues, BIOINFORMATICS, Vol: 32, Pages: 523-532, ISSN: 1367-4803
MOTIVATION: Analysing the joint association between a large set of responses and predictors is a fundamental statistical task in integrative genomics, exemplified by numerous expression Quantitative Trait Loci (eQTL) studies. Of particular interest are the so-called ': hotspots ': , important genetic variants that regulate the expression of many genes. Recently, attention has focussed on whether eQTLs are common to several tissues, cell-types or, more generally, conditions or whether they are specific to a particular condition. RESULTS: We have implemented MT-HESS, a Bayesian hierarchical model that analyses the association between a large set of predictors, e.g. SNPs, and many responses, e.g. gene expression, in multiple tissues, cells or conditions. Our Bayesian sparse regression algorithm goes beyond ': one-at-a-time ': association tests between SNPs and responses and uses a fully multivariate model search across all linear combinations of SNPs, coupled with a model of the correlation between condition/tissue-specific responses. In addition, we use a hierarchical structure to leverage shared information across different genes, thus improving the detection of hotspots. We show the increase of power resulting from our new approach in an extensive simulation study. Our analysis of two case studies highlights new hotspots that would remain undetected by standard approaches and shows how greater prediction power can be achieved when several tissues are jointly considered. AVAILABILITY AND IMPLEMENTATION: C[Formula: see text] source code and documentation including compilation instructions are available under GNU licence at http://www.mrc-bsu.cam.ac.uk/software/.
Mattei F, Liverani S, Guida F, et al., 2016, Multidimensional analysis of the effect of occupational exposure to organic solvents on lung cancer risk: the ICARE study, OCCUPATIONAL AND ENVIRONMENTAL MEDICINE, Vol: 73, Pages: 368-377, ISSN: 1351-0711
BACKGROUND: The association between lung cancer and occupational exposure to organic solvents is discussed. Since different solvents are often used simultaneously, it is difficult to assess the role of individual substances. OBJECTIVES: The present study is focused on an in-depth investigation of the potential association between lung cancer risk and occupational exposure to a large group of organic solvents, taking into account the well-known risk factors for lung cancer, tobacco smoking and occupational exposure to asbestos. METHODS: We analysed data from the Investigation of occupational and environmental causes of respiratory cancers (ICARE) study, a large French population-based case-control study, set up between 2001 and 2007. A total of 2276 male cases and 2780 male controls were interviewed, and long-life occupational history was collected. In order to overcome the analytical difficulties created by multiple correlated exposures, we carried out a novel type of analysis based on Bayesian profile regression. RESULTS: After analysis with conventional logistic regression methods, none of the 11 solvents examined were associated with lung cancer risk. Through a profile regression approach, we did not observe any significant association between solvent exposure and lung cancer. However, we identified clusters at high risk that are related to occupations known to be at risk of developing lung cancer, such as painters. CONCLUSIONS: Organic solvents do not appear to be substantial contributors to the occupational risk of lung cancer for the occupations known to be at risk.
Papathomas M, Richardson S, Papathomas M, et al., 2016, Exploring dependence between categorical variables: Benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms, JOURNAL OF STATISTICAL PLANNING AND INFERENCE, Vol: 173, Pages: 47-63, ISSN: 0378-3758
This manuscript is concerned with relating two approaches that can be used to explore complex dependence structures between categorical variables, namely Bayesian partitioning of the covariate space incorporating a variable selection procedure that highlights the covariates that drive the clustering, and log-linear modelling with interaction terms. We derive theoretical results on this relation and discuss if they can be employed to assist log-linear model determination, demonstrating advantages and limitations with simulated and real data sets. The main advantage concerns sparse contingency tables. Inferences from clustering can potentially reduce the number of covariates considered and, subsequently, the number of competing log-linear models, making the exploration of the model space feasible. Variable selection within clustering can inform on marginal independence in general, thus allowing for a more efficient exploration of the log-linear model space. However, we show that the clustering structure is not informative on the existence of interactions in a consistent manner. This work is of interest to those who utilize log-linear models, as well as practitioners such as epidemiologists that use clustering models to reduce the dimensionality in the data and to reveal interesting patterns on how covariates combine.
Geneletti S, O'Keeffe AG, Sharples LD, et al., 2015, Bayesian regression discontinuity designs: incorporating clinical knowledge in the causal analysis of primary care data, STATISTICS IN MEDICINE, Vol: 34, Pages: 2334-2352, ISSN: 0277-6715
The regression discontinuity (RD) design is a quasi-experimental design that estimates the causal effects of a treatment by exploiting naturally occurring treatment rules. It can be applied in any context where a particular treatment or intervention is administered according to a pre-specified rule linked to a continuous variable. Such thresholds are common in primary care drug prescription where the RD design can be used to estimate the causal effect of medication in the general population. Such results can then be contrasted to those obtained from randomised controlled trials (RCTs) and inform prescription policy and guidelines based on a more realistic and less expensive context. In this paper, we focus on statins, a class of cholesterol-lowering drugs, however, the methodology can be applied to many other drugs provided these are prescribed in accordance to pre-determined guidelines. Current guidelines in the UK state that statins should be prescribed to patients with 10-year cardiovascular disease risk scores in excess of 20%. If we consider patients whose risk scores are close to the 20% risk score threshold, we find that there is an element of random variation in both the risk score itself and its measurement. We can therefore consider the threshold as a randomising device that assigns statin prescription to individuals just above the threshold and withholds it from those just below. Thus, we are effectively replicating the conditions of an RCT in the area around the threshold, removing or at least mitigating confounding. We frame the RD design in the language of conditional independence, which clarifies the assumptions necessary to apply an RD design to data, and which makes the links with instrumental variables clear. We also have context-specific knowledge about the expected sizes of the effects of statin prescription and are thus able to incorporate this into Bayesian models by formulating informative priors on our causal parameters.
Hastie DI, Liverani S, Richardson S, et al., 2015, Sampling from Dirichlet process mixture models with unknown concentration parameter: mixing issues in large data implementations, STATISTICS AND COMPUTING, Vol: 25, Pages: 1023-1037, ISSN: 0960-3174
We consider the question of Markov chain Monte Carlo sampling from a general stick-breaking Dirichlet process mixture model, with concentration parameter [Formula: see text]. This paper introduces a Gibbs sampling algorithm that combines the slice sampling approach of Walker (Communications in Statistics - Simulation and Computation 36:45-54, 2007) and the retrospective sampling approach of Papaspiliopoulos and Roberts (Biometrika 95(1):169-186, 2008). Our general algorithm is implemented as efficient open source C++ software, available as an R package, and is based on a blocking strategy similar to that suggested by Papaspiliopoulos (A note on posterior sampling from Dirichlet mixture models, 2008) and implemented by Yau et al. (Journal of the Royal Statistical Society, Series B (Statistical Methodology) 73:37-57, 2011). We discuss the difficulties of achieving good mixing in MCMC samplers of this nature in large data sets and investigate sensitivity to initialisation. We additionally consider the challenges when an additional layer of hierarchy is added such that joint inference is to be made on [Formula: see text]. We introduce a new label-switching move and compute the marginal partition posterior to help to surmount these difficulties. Our work is illustrated using a profile regression (Molitor et al. Biostatistics 11(3):484-498, 2010) application, where we demonstrate good mixing behaviour for both synthetic and real examples.
Liverani S, Hastie DI, Azizi L, et al., 2015, PReMiuM: An R Package for Profile Regression Mixture Models Using Dirichlet Processes, JOURNAL OF STATISTICAL SOFTWARE, Vol: 64, Pages: 1-30, ISSN: 1548-7660
PReMiuM is a recently developed R package for Bayesian clustering using a Dirichlet process mixture model. This model is an alternative to regression models, non-parametrically linking a response vector to covariate data through cluster membership (Molitor, Papathomas, Jerrett, and Richardson 2010). The package allows binary, categorical, count and continuous response, as well as continuous and discrete covariates. Additionally, predictions may be made for the response, and missing values for the covariates are handled. Several samplers and label switching moves are implemented along with diagnostic tools to assess convergence. A number of R functions for post-processing of the output are also provided. In addition to fitting mixtures, it may additionally be of interest to determine which covariates actively drive the mixture components. This is implemented in the package as variable selection.
Vallejos CA, Marioni JC, Richardson S, et al., 2015, BASiCS: Bayesian Analysis of Single-Cell Sequencing Data, PLOS COMPUTATIONAL BIOLOGY, Vol: 11, Pages: e1004333-e1004333, ISSN: 1553-734X
Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels in seemingly homogeneous populations of cells. However, these experiments are prone to high levels of unexplained technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the population of cells under study. BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model where: (i) cell-specific normalisation constants are estimated as part of the model parameters, (ii) technical variability is quantified based on spike-in genes that are artificially introduced to each analysed cell's lysate and (iii) the total variability of the expression counts is decomposed into technical and biological components. BASiCS also provides an intuitive detection criterion for highly (or lowly) variable genes within the population of cells under study. This is formalised by means of tail posterior probabilities associated to high (or low) biological cell-to-cell variance contributions, quantities that can be easily interpreted by users. We demonstrate our method using gene expression measurements from mouse Embryonic Stem Cells. Cross-validation and meaningful enrichment of gene ontology categories within genes classified as highly (or lowly) variable supports the efficacy of our approach.
Wallace C, Cutler AJ, Pontikos N, et al., 2015, Dissection of a Complex Disease Susceptibility Region Using a Bayesian Stochastic Search Approach to Fine Mapping, PLOS GENETICS, Vol: 11, Pages: e1005272-e1005272, ISSN: 1553-7404
Identification of candidate causal variants in regions associated with risk of common diseases is complicated by linkage disequilibrium (LD) and multiple association signals. Nonetheless, accurate maps of these variants are needed, both to fully exploit detailed cell specific chromatin annotation data to highlight disease causal mechanisms and cells, and for design of the functional studies that will ultimately be required to confirm causal mechanisms. We adapted a Bayesian evolutionary stochastic search algorithm to the fine mapping problem, and demonstrated its improved performance over conventional stepwise and regularised regression through simulation studies. We then applied it to fine map the established multiple sclerosis (MS) and type 1 diabetes (T1D) associations in the IL-2RA (CD25) gene region. For T1D, both stepwise and stochastic search approaches identified four T1D association signals, with the major effect tagged by the single nucleotide polymorphism, rs12722496. In contrast, for MS, the stochastic search found two distinct competing models: a single candidate causal variant, tagged by rs2104286 and reported previously using stepwise analysis; and a more complex model with two association signals, one of which was tagged by the major T1D associated rs12722496 and the other by rs56382813. There is low to moderate LD between rs2104286 and both rs12722496 and rs56382813 (r2 ≃ 0:3) and our two SNP model could not be recovered through a forward stepwise search after conditioning on rs2104286. Both signals in the two variant model for MS affect CD25 expression on distinct subpopulations of CD4+ T cells, which are key cells in the autoimmune process. The results support a shared causal variant for T1D and MS. Our study illustrates the benefit of using a purposely designed model search strategy for fine mapping and the advantage of combining disease and protein expression data.
Chadeau-Hyam M, Tubert-Bitter P, Guihenneuc-Jouyaux C, et al., 2014, Dynamics of the Risk of Smoking-Induced Lung Cancer A Compartmental Hidden Markov Model for Longitudinal Analysis, EPIDEMIOLOGY, Vol: 25, Pages: 28-34, ISSN: 1044-3983
BACKGROUND: To account for the dynamic aspects of carcinogenesis, we propose a compartmental hidden Markov model in which each person is healthy, asymptomatically affected, diagnosed, or deceased. Our model is illustrated using the example of smoking-induced lung cancer. METHODS: The model was fitted on a case-control study nested in the European Prospective Investigation into Cancer and Nutrition study, including 757 incident cases and 1524 matched controls. Estimation was done through a Markov Chain Monte Carlo algorithm, and simulations based on the posterior estimates of the parameters were used to provide measures of model fit. We performed sensitivity analyses to assess robustness of our findings. RESULTS: After adjusting for its impact on exposure duration, age was not found to independently drive the risk of lung carcinogenesis, whereas age at starting smoking in ever-smokers and time since cessation in former smokers were found to be influential. Our data did not support an age-dependent time to diagnosis. The estimated time between onset of malignancy and clinical diagnosis ranged from 2 to 4 years. Our approach yielded good performance in reconstructing individual trajectories in both cases (sensitivity >90%) and controls (sensitivity >80%). CONCLUSION: Our compartmental model enabled us to identify time-varying predictors of risk and provided us with insights into the dynamics of smoking-induced lung carcinogenesis. Its flexible and general formulation enables the future incorporation of disease states, as measured by intermediate markers, into the modeling of the natural history of cancer, suggesting a large range of applications in chronic disease epidemiology.
Chen L, Kostadima M, Martens JHA, et al., 2014, Transcriptional diversity during lineage commitment of human blood progenitors, SCIENCE, Vol: 345, Pages: 1580-+, ISSN: 0036-8075
Blood cells derive from hematopoietic stem cells through stepwise fating events. To characterize gene expression programs driving lineage choice, we sequenced RNA from eight primary human hematopoietic progenitor populations representing the major myeloid commitment stages and the main lymphoid stage. We identified extensive cell type-specific expression changes: 6711 genes and 10,724 transcripts, enriched in non-protein-coding elements at early stages of differentiation. In addition, we found 7881 novel splice junctions and 2301 differentially used alternative splicing events, enriched in genes involved in regulatory processes. We demonstrated experimentally cell-specific isoform usage, identifying nuclear factor I/B (NFIB) as a regulator of megakaryocyte maturation-the platelet precursor. Our data highlight the complexity of fating events in closely related progenitor populations, the understanding of which is essential for the advancement of transplantation and regenerative medicine.
Li G, Haining R, Richardson S, et al., 2014, Space-time variability in burglary risk: A Bayesian spatio-temporal modelling approach, SPATIAL STATISTICS, Vol: 9, Pages: 180-191, ISSN: 2211-6753
Molitor J, Brown IJ, Chan Q, et al., 2014, Blood Pressure Differences Associated With Optimal Macronutrient Intake Trial for Heart Health (OMNIHEART)-Like Diet Compared With a Typical American Diet, HYPERTENSION, Vol: 64, Pages: 1198-U86, ISSN: 0194-911X
The Dietary Approaches to Stop Hypertension-Sodium (DASH-Sodium) trial demonstrated beneficial effects on blood pressure (BP) of the DASH diet with lower sodium intake when compared with typical American diet. The subsequent Optimal Macronutrient Intake Trial for Heart Health (OMNIHEART) trial reported additional BP benefits from replacing carbohydrate in the DASH diet with either protein or monounsaturated fats. The primary aim of this study is to assess possible BP benefits of an OMNIHEART-like diet in free-living Americans using cross-sectional US population data of the International Study of Macronutrients, Micronutrients and Blood Pressure (INTERMAP) study. The INTERMAP data include four 24-hour dietary recalls, 2 timed 24-hour urine collections, 8 BP readings for 2195 individuals aged 40 to 59 years from 8 US INTERMAP population samples. Analyses are conducted using 2 approaches: (1) regression of BP on a linear OMNIHEART nutrient score calculated for each individual and (2) a Bayesian approach comparing estimated BP levels of an OMNIHEART-like nutrient profile with a typical American nutrient profile. After adjustment for potential confounders, an OMNIHEART score higher by 1 point was associated with systolic/diastolic BP differences of -1.0/-0.5 mm Hg (both P<0.001). Mean systolic/diastolic BPs were 111.3/68.4 and 115.2/70.6 mm Hg for Bayesian OMNIHEART and Control profiles, respectively, after controlling for possible confounders, with BP differences of -3.9/-2.2 mm Hg, P(difference≤0)=0.98/0.96. Findings were comparable for men and women, for nonhypertensive participants, and with adjustment for antihypertensive treatment. Our findings from data on US population samples indicate broad generalizability of OMNIHEART results beyond the trial setting and support recommendations for an OMNIHEART-style diet for prevention/control of population-wide adverse BP levels.
Pettit J-B, Tomer R, Achim K, et al., 2014, Identifying Cell Types from Spatially Referenced Single-Cell Expression Datasets, PLOS COMPUTATIONAL BIOLOGY, Vol: 10, Pages: e1003824-e1003824, ISSN: 1553-734X
Complex tissues, such as the brain, are composed of multiple different cell types, each of which have distinct and important roles, for example in neural function. Moreover, it has recently been appreciated that the cells that make up these sub-cell types themselves harbour significant cell-to-cell heterogeneity, in particular at the level of gene expression. The ability to study this heterogeneity has been revolutionised by advances in experimental technology, such as Wholemount in Situ Hybridizations (WiSH) and single-cell RNA-sequencing. Consequently, it is now possible to study gene expression levels in thousands of cells from the same tissue type. After generating such data one of the key goals is to cluster the cells into groups that correspond to both known and putatively novel cell types. Whilst many clustering algorithms exist, they are typically unable to incorporate information about the spatial dependence between cells within the tissue under study. When such information exists it provides important insights that should be directly included in the clustering scheme. To this end we have developed a clustering method that uses a Hidden Markov Random Field (HMRF) model to exploit both quantitative measures of expression and spatial information. To accurately reflect the underlying biology, we extend current HMRF approaches by allowing the degree of spatial coherency to differ between clusters. We demonstrate the utility of our method using simulated data before applying it to cluster single cell gene expression data generated by applying WiSH to study expression patterns in the brain of the marine annelid Platynereis dumereilii. Our approach allows known cell types to be identified as well as revealing new, previously unexplored cell types within the brain of this important model system.
Zucknick M, Richardson S, 2014, MCMC algorithms for Bayesian variable selection in the logistic regression model for large-scale genomic applications
In large-scale genomic applications vast numbers of molecular features arescanned in order to find a small number of candidates which are linked to aparticular disease or phenotype. This is a variable selection problem in the"large p, small n" paradigm where many more variables than samples areavailable. Additionally, a complex dependence structure is often observed amongthe markers/genes due to their joint involvement in biological processes andpathways. Bayesian variable selection methods that introduce sparseness throughadditional priors on the model size are well suited to the problem. However,the model space is very large and standard Markov chain Monte Carlo (MCMC)algorithms such as a Gibbs sampler sweeping over all p variables in eachiteration are often computationally infeasible. We propose to employ thedependence structure in the data to decide which variables should always beupdated together and which are nearly conditionally independent and hence donot need to be considered together. Here, we focus on binary classificationapplications. We follow the implementation of the Bayesian probit regressionmodel by Albert and Chib (1993) and the Bayesian logistic regression model byHolmes and Held (2006) which both lead to marginal Gaussian distributions. Wein- vestigate several MCMC samplers using the dependence structure in differentways. The mixing and convergence performances of the resulting Markov chainsare evaluated and compared to standard samplers in two simulation studies andin an application to a real gene expression data set.
Bottolo L, Chadeau-Hyam M, Hastie DI, et al., 2013, GUESS-ing Polygenic Associations with Multiple Phenotypes Using a GPU-Based Evolutionary Stochastic Search Algorithm, PLOS GENETICS, Vol: 9, Pages: e1003657-e1003657, ISSN: 1553-7404
Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex Linkage Disequilibrium patterns between SNPs and correlation between traits. Here we propose a computationally efficient algorithm (GUESS) to explore complex genetic-association models and maximize genetic variant detection. We integrated our algorithm with a new Bayesian strategy for multi-phenotype analysis to identify the specific contribution of each SNP to different trait combinations and study genetic regulation of lipid metabolism in the Gutenberg Health Study (GHS). Despite the relatively small size of GHS (n = 3,175), when compared with the largest published meta-GWAS (n > 100,000), GUESS recovered most of the major associations and was better at refining multi-trait associations than alternative methods. Amongst the new findings provided by GUESS, we revealed a strong association of SORT1 with TG-APOB and LIPC with TG-HDL phenotypic groups, which were overlooked in the larger meta-GWAS and not revealed by competing approaches, associations that we replicated in two independent cohorts. Moreover, we demonstrated the increased power of GUESS over alternative multi-phenotype approaches, both Bayesian and non-Bayesian, in a simulation study that mimics real-case scenarios. We showed that our parallel implementation based on Graphics Processing Units outperforms alternative multi-phenotype methods. Beyond multivariate modelling of multi-phenotypes, our Bayesian model employs a flexible hierarchical prior structure for genetic effects that adapts to any correlation structure of the predictors and increases the power to identify associated variants. Thi
Geneletti S, Best N, Toledano MB, et al., 2013, Uncovering selection bias in case-control studies using Bayesian post-stratification, STATISTICS IN MEDICINE, Vol: 32, Pages: 2555-2570, ISSN: 0277-6715
Case-control studies are particularly prone to selection bias, which can affect odds ratio estimation. Approaches to discovering and adjusting for selection bias have been proposed in the literature using graphical and heuristic tools as well as more complex statistical methods. The approach we propose is based on a survey-weighting method termed Bayesian post-stratification and follows from the conditional independences that characterise selection bias. We use our approach to perform a selection bias sensitivity analysis by using ancillary data sources that describe the target case-control population to re-weight the odds ratio estimates obtained from the study. The method is applied to two case-control studies, the first investigating the association between exposure to electromagnetic fields and acute lymphoblastic leukaemia in children and the second investigating the association between maternal occupational exposure to hairspray and a congenital anomaly in male babies called hypospadias. In both case-control studies, our method showed that the odds ratios were only moderately sensitive to selection bias.
Hansell AL, Blangiardo M, Fortunato L, et al., 2013, Aircraft noise and cardiovascular disease near Heathrow airport in London: small area study, BMJ-BRITISH MEDICAL JOURNAL, Vol: 347, Pages: f5432-f5432, ISSN: 1756-1833
OBJECTIVE: To investigate the association of aircraft noise with risk of stroke, coronary heart disease, and cardiovascular disease in the general population. DESIGN: Small area study. SETTING: 12 London boroughs and nine districts west of London exposed to aircraft noise related to Heathrow airport in London. POPULATION: About 3.6 million residents living near Heathrow airport. Risks for hospital admissions were assessed in 12 110 census output areas (average population about 300 inhabitants) and risks for mortality in 2378 super output areas (about 1500 inhabitants). MAIN OUTCOME MEASURES: Risk of hospital admissions for, and mortality from, stroke, coronary heart disease, and cardiovascular disease, 2001-05. RESULTS: Hospital admissions showed statistically significant linear trends (P<0.001 to P<0.05) of increasing risk with higher levels of both daytime (average A weighted equivalent noise 7 am to 11 pm, L(Aeq),16 h) and night time (11 pm to 7 am, Lnight) aircraft noise. When areas experiencing the highest levels of daytime aircraft noise were compared with those experiencing the lowest levels (>63 dB v ≤ 51 dB), the relative risk of hospital admissions for stroke was 1.24 (95% confidence interval 1.08 to 1.43), for coronary heart disease was 1.21 (1.12 to 1.31), and for cardiovascular disease was 1.14 (1.08 to 1.20) adjusted for age, sex, ethnicity, deprivation, and a smoking proxy (lung cancer mortality) using a Poisson regression model including a random effect term to account for residual heterogeneity. Corresponding relative risks for mortality were of similar magnitude, although with wider confidence limits. Admissions for coronary heart disease and cardiovascular disease were particularly affected by adjustment for South Asian ethnicity, which needs to be considered in interpretation. All results were robust to adjustment for particulate matter (PM10) air pollution, and road traffic noise, possible for London boroughs (population about 2.6 mi
Hastie DI, Liverani S, Azizi L, et al., 2013, A semi-parametric approach to estimate risk functions associated with multi-dimensional exposure profiles: application to smoking and lung cancer, BMC MEDICAL RESEARCH METHODOLOGY, Vol: 13, ISSN: 1471-2288
BACKGROUND: A common characteristic of environmental epidemiology is the multi-dimensional aspect of exposure patterns, frequently reduced to a cumulative exposure for simplicity of analysis. By adopting a flexible Bayesian clustering approach, we explore the risk function linking exposure history to disease. This approach is applied here to study the relationship between different smoking characteristics and lung cancer in the framework of a population based case control study. METHODS: Our study includes 4658 males (1995 cases, 2663 controls) with full smoking history (intensity, duration, time since cessation, pack-years) from the ICARE multi-centre study conducted from 2001-2007. We extend Bayesian clustering techniques to explore predictive risk surfaces for covariate profiles of interest. RESULTS: We were able to partition the population into 12 clusters with different smoking profiles and lung cancer risk. Our results confirm that when compared to intensity, duration is the predominant driver of risk. On the other hand, using pack-years of cigarette smoking as a single summary leads to a considerable loss of information. CONCLUSIONS: Our method estimates a disease risk associated to a specific exposure profile by robustly accounting for the different dimensions of exposure and will be helpful in general to give further insight into the effect of exposures that are accumulated through different time patterns.
Kirk P, Witkover A, Bangham CRM, et al., 2013, Balancing the Robustness and Predictive Performance of Biomarkers, JOURNAL OF COMPUTATIONAL BIOLOGY, Vol: 20, Pages: 979-989, ISSN: 1066-5277
Recent studies have highlighted the importance of assessing the robustness of putative biomarkers identified from experimental data. This has given rise to the concept of stable biomarkers, which are ones that are consistently identified regardless of small perturbations to the data. Since stability is not by itself a useful objective, we present a number of strategies that combine assessments of stability and predictive performance in order to identify biomarkers that are both robust and diagnostically useful. Moreover, by wrapping these strategies around logistic regression classifiers regularized by the elastic net penalty, we are able to assess the effects of correlations between biomarkers upon their perceived stability. We use a synthetic example to illustrate the properties of our proposed strategies. In this example, we find that: (i) assessments of stability can help to reduce the number of false-positive biomarkers, although potentially at the cost of missing some true positives; (ii) combining assessments of stability with assessments of predictive performance can improve the true positive rate; and (iii) correlations between biomarkers can have adverse effects on their stability and hence must be carefully taken into account when undertaking biomarker discovery. We then apply our strategies in a proteomics context to identify a number of robust candidate biomarkers for the human disease HTLV1-associated myelopathy/tropical spastic paraparesis (HAM/TSP).
Li G, Haining R, Richardson S, et al., 2013, Evaluating the No Cold Calling zones in Peterborough, England: application of a novel statistical method for evaluating neighbourhood policing policies, ENVIRONMENT AND PLANNING A, Vol: 45, Pages: 2012-2026, ISSN: 0308-518X
Some police Forces in the UK institute “No Cold Calling” (NCC) zones to reduce cold callings (unsolicited visits to sell products or services), which are often associated with rogue trading and distraction burglary. This paper evaluates the NCC targeted areas chosen in 2005-6 in Peterborough and reports whether they experienced a measurable impact on their burglary rates in the period up to 2008. Time series data for burglary at the Census Output Area level are analyzed using a Bayesian hierarchical modelling approach to address issues of data sparsity and lack of randomized allocation of areas to treatment groups that are often encountered in small area quantitative policy evaluation. To ensure internal validity, we employ the interrupted time series quasi-experimental design embedded within a matched case-control framework. Results reveal a positive impact of NCC zones on reducing burglary rates in the targeted areas compared to the control areas.
Ancelet S, Abellan JJ, Vilas VJDR, et al., 2012, Bayesian shared spatial-component models to combine and borrow strength across sparse disease surveillance sources, BIOMETRICAL JOURNAL, Vol: 54, Pages: 385-404, ISSN: 0323-3847
When analyzing the geographical variations of disease risk, one common problem is data sparseness. In such a setting, we investigate the possibility of using Bayesian shared spatial component models to strengthen inference and correct for any spatially structured sources of bias, when distinct data sources on one or more related diseases are available. Specifically, we apply our models to analyze the spatial variation of risk of two forms of scrapie infection affecting sheep in Wales (UK) using three surveillance sources on each disease. We first model each disease separately from the combined data sources and then extend our approach to jointly analyze diseases and data sources. We assess the predictive performances of several nested joint models through pseudo cross-validatory predictive model checks.
Astle W, De Iorio M, Richardson S, et al., 2012, A Bayesian Model of NMR Spectra for the Deconvolution and Quantification of Metabolites in Complex Biological Mixtures, JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, Vol: 107, Pages: 1259-1271, ISSN: 0162-1459
Nuclear magnetic resonance (NMR) spectra are widely used in metabolomics to obtain profiles of metabolites dissolved in biofluids such as cell supernatants. Methods for estimating metabolite concentrations from these spectra are presently confined to manual peak fitting and to binning procedures for integrating resonance peaks. Extensive information on the patterns of spectral resonance generated by human metabolites is now available in online databases. By incorporating this information into a Bayesian model, we can deconvolve resonance peaks from a spectrum and obtain explicit concentration estimates for the corresponding metabolites. Spectral resonances that cannot be deconvolved in this way may also be of scientific interest; so, we model them jointly using wavelets. We describe a Markov chain Monte Carlo algorithm that allows us to sample from the joint posterior distribution of the model parameters, using specifically designed block updates to improve mixing. The strong prior on resonance patterns allows the algorithm to identify peaks corresponding to particular metabolites automatically, eliminating the need for manual peak assignment. We assess our method for peak alignment and concentration estimation. Except in cases when the target resonance signal is very weak, alignment is unbiased and precise. We compare the Bayesian concentration estimates with those obtained from a conventional numerical integration method and find that our point estimates have six-fold lower mean squared error. Finally, we apply our method to a spectral dataset taken from an investigation of the metabolic response of yeast to recombinant protein expression. We estimate the concentrations of 26 metabolites and compare with manual quantification by five expert spectroscopists. We discuss the reason for discrepancies and the robustness of our method's concentration estimates. This article has supplementary materials online. © 2012 American Statistical Association.
Bergersen LC, Ahmed I, Frigessi A, et al., 2012, Safe preselection in lasso-type problems by cross-validation freezing
We propose a new approach to safe variable preselection in high-dimensionalpenalized regression, such as the lasso. Preselection - to start with amanageable set of covariates - has often been implemented without clearappreciation of its potential bias. Based on sequential implementation of thelasso with increasing lists of predictors, we find a new property of the set ofcorresponding cross-validation curves, a pattern that we call freezing. Itallows to determine a subset of covariates with which we reach the same lassosolution as would be obtained using the full set of covariates. Freezing hasnot been characterized before and is different from recently discussed saferules for discarding predictors. We demonstrate by simulation that rankingpredictors by their univariate correlation with the outcome, leads in amajority of cases to early freezing, giving a safe and efficient way offocusing the lasso analysis on a smaller and manageable number of predictors.We illustrate the applicability of our strategy in the context of a GWASanalysis and on microarray genomic data. Freezing offers great potential forextending the applicability of penalized regressions to ultra highdimensionaldata sets. Its applicability is not limited to the standard lasso but is ageneric property of many penalized approaches.
Clark SJ, Falchi M, Olsson B, et al., 2012, Association of Sirtuin 1 (SIRT1) Gene SNPs and Transcript Expression Levels With Severe Obesity, OBESITY, Vol: 20, Pages: 178-185, ISSN: 1930-7381
Recent studies have reported associations of sirtuin 1 (SIRT1) single nucleotide polymorphisms (SNPs) to both obesity and BMI. This study was designed to investigate association between SIRT1 SNPs, SIRT1 gene expression and obesity. Case-control analyses were performed using 1,533 obese subjects (896 adults, BMI >40 kg/m(2) and 637 children, BMI >97th percentile for age and sex) and 1,237 nonobese controls, all French Caucasians. Two SNPs (in high linkage disequilibrium (LD), r(2) = 0.96) were significantly associated with adult obesity, rs33957861 (P value = 0.003, odds ratio (OR) = 0.75, confidence interval (CI) = 0.61-0.92) and rs11599176 (P value: 0.006, OR = 0.74, CI = 0.61-0.90). Expression of SIRT1 mRNA was measured in BMI-discordant siblings from 154 Swedish families. Transcript expression was significantly correlated to BMI in the lean siblings (r(2) = 0.13, P value = 3.36 × 10(-7)) and lower SIRT1 expression was associated with obesity (P value = 1.56 × 10(-35)). There was also an association between four SNPs (rs11599176, rs12413112, rs33957861, and rs35689145) and BMI (P values: 4 × 10(-4), 6 × 10(-4), 4 × 10(-4), and 2 × 10(-3)) with the rare allele associated with a lower BMI. However, no SNP was associated with SIRT1 transcript expression level. In summary, both SNPs and SIRT1 gene expression are associated with severe obesity.
Ginestet CE, Best NG, Richardson S, et al., 2012, Classification loss function for parameter ensembles in Bayesian hierarchical models, STATISTICS & PROBABILITY LETTERS, Vol: 82, Pages: 859-863, ISSN: 0167-7152
Parameter ensembles or sets of point estimates constitute one of thecornerstones of modern statistical practice. This is especially the case inBayesian hierarchical models, where different decision-theoretic frameworks canbe deployed to summarize such parameter ensembles. The estimation of theseparameter ensembles may thus substantially vary depending on which inferentialgoals are prioritised by the modeller. In this note, we consider the problem ofclassifying the elements of a parameter ensemble above or below a giventhreshold. Two threshold classification losses (TCLs) --weighted andunweighted-- are formulated. The weighted TCL can be used to emphasize theestimation of false positives over false negatives or the converse. We provethat the weighted and unweighted TCLs are optimized by the ensembles ofunit-specific posterior quantiles and posterior medians, respectively. Inaddition, we relate these classification loss functions on parameter ensemblesto the concepts of posterior sensitivity and specificity. Finally, we find somerelationships between the unweighted TCL and the absolute value loss, whichexplain why both functions are minimized by posterior medians.
Li G, Best N, Hansell AL, et al., 2012, BaySTDetect: detecting unusual temporal patterns in small area data via Bayesian model choice, BIOSTATISTICS, Vol: 13, Pages: 695-710, ISSN: 1465-4644
Space-time modeling of small area data is often used in epidemiology for mapping chronic disease rates and by government statistical agencies for producing local estimates of, for example, unemployment or crime rates. Although there is typically a general temporal trend, which affects all areas similarly, abrupt changes may occur in a particular area, e.g. due to emergence of localized predictors/risk factor(s) or impact of a new policy. Detection of areas with "unusual" temporal patterns is therefore important as a screening tool for further investigations. In this paper, we propose BaySTDetect, a novel detection method for short-time series of small area data using Bayesian model choice between two competing space-time models. The first model is a multiplicative decomposition of the area effect and the temporal effect, assuming one common temporal pattern across the whole study region. The second model estimates the time trends independently for each area. For each area, the posterior probability of belonging to the common trend model is calculated, which is then used to classify the local time trend as unusual or not. Crucial to any detection method, we provide a Bayesian estimate of the false discovery rate (FDR). A comprehensive simulation study has demonstrated the consistent good performance of BaySTDetect in detecting various realistic departure patterns in addition to estimating well the FDR. The proposed method is applied retrospectively to mortality data on chronic obstructive pulmonary disease (COPD) in England and Wales between 1990 and 1997 (a) to test a hypothesis that a government policy increased the diagnosis of COPD and (b) to perform surveillance. While results showed no evidence supporting the hypothesis regarding the policy, an identified unusual district (Tower Hamlets in inner London) was later recognized to have higher than national rates of hospital readmission and mortality due to COPD by the National Health Service, which initia
Mason A, Richardson S, Best N, et al., 2012, Two-pronged Strategy for Using DIC to Compare Selection Models with Non-Ignorable Missing Responses, BAYESIAN ANALYSIS, Vol: 7, Pages: 109-146, ISSN: 1931-6690
Data with missing responses generated by a non-ignorable missing-ness mechanism can be analysed by jointly modelling the response and a binaryvariable indicating whether the response is observed or missing. Using a selectionmodel factorisation, the resulting joint model consists of a model of interest anda model of missingness. In the case of non-ignorable missingness, model choice isdi±cult because the assumptions about the missingness model are never veri¯ablefrom the data at hand. For complete data, the Deviance Information Criterion(DIC) is routinely used for Bayesian model comparison. However, when an anal-ysis includes missing data, DIC can be constructed in di®erent ways and its useand interpretation are not straightforward. In this paper, we present a strategy forcomparing selection models by combining information from two measures takenfrom di®erent constructions of the DIC. A DIC based on the observed data likeli-hood is used to compare joint models with di®erent models of interest but the samemodel of missingness, and a comparison of models with the same model of interestbut di®erent models of missingness is carried out using the model of missingnesspart of a conditional DIC. This strategy is intended for use within a sensitivityanalysis that explores the impact of di®erent assumptions about the two parts ofthe model, and is illustrated by examples with simulated missingness and an appli-cation which compares three treatments for depression using data from a clinicaltrial. We also examine issues relating to the calculation of the DIC based on theobserved data likelihood.
Mason A, Richardson S, Plewis I, et al., 2012, Strategy for Modelling Nonrandom Missing Data Mechanisms in Observational Studies Using Bayesian Methods, JOURNAL OF OFFICIAL STATISTICS, Vol: 28, Pages: 279-302, ISSN: 0282-423X
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.