Publications

Pennica C, Sternberg M, Islam S, David Aet al., 2023, Missense3D-PPI: a web resource to predict the impact of missense variants at protein interfaces using 3D structural data, Journal of Molecular Biology, Vol: 435, Pages: 1-9, ISSN: 0022-2836

In 2019, we released Missense3D which identifies stereochemical features that are disrupted by a missense variant, such as introducing a buried charge. Missense3D analyses the effect of a missense variant on a single structure and thus may fail to identify as damaging surface variants disrupting a protein interface i.e., a protein–protein interaction (PPI) site. Here we present Missense3D-PPI designed to predict missense variants at PPI interfaces.Our development dataset comprised of 1,279 missense variants (pathogenic n = 733, benign n = 546) in 434 proteins and 545 experimental structures of PPI complexes. Benchmarking of Missense3D-PPI was performed after dividing the dataset in training (320 benign and 320 pathogenic variants) and testing (226 benign and 413 pathogenic). Structural features affecting PPI, such as disruption of interchain bonds and introduction of unbalanced charged interface residues, were analysed to assess the impact of the variant at PPI.The performance of Missense3D-PPI was superior to that of Missense3D: sensitivity 44 % versus 8% and accuracy 58% versus 40%, p = 4.23 × 10−16. However, the specificity of Missense3D-PPI was lower compared to Missense3D (84% versus 98%). On our dataset, Missense3D-PPI’s accuracy was superior to BeAtMuSiC (p = 3.4 × 10−5), mCSM-PPI2 (p = 1.5 × 10−12) and MutaBind2 (p = 0.0025).Missense3D-PPI represents a valuable tool for predicting the structural effect of missense variants on biological protein networks and is available at the Missense3D web portal (http://missense3d.bc.ic.ac.uk).

Journal article

Mathews DH, Casadio R, Sternberg MJE, 2023, Computational Resources for Molecular Biology 2023., J Mol Biol, Vol: 435

Journal article

David A, Sternberg MJE, 2023, Protein structure-based evaluation of missense variants: Resources, challenges and future directions., Current Opinion in Structural Biology, Vol: 80, Pages: 1-8, ISSN: 0959-440X

We provide an overview of the methods that can be used for protein structure-based evaluation of missense variants. The algorithms can be broadly divided into those that calculate the difference in free energy (ΔΔG) between the wild type and variant structures and those that use structural features to predict the damaging effect of a variant without providing a ΔΔG. A wide range of machine learning approaches have been employed to develop those algorithms. We also discuss challenges and opportunities for variant interpretation in view of the recent breakthrough in three-dimensional structural modelling using deep learning.

Journal article

Hanna G, Khanna T, Islam S, David A, Sternberg Met al., 2023, Missense3D-TM: predicting the effect of missense variants in helical transmembrane protein regions using 3D protein structures, Journal of Molecular Biology, Vol: 436, ISSN: 0022-2836

Variant effect predictors assess if a substitution is pathogenic or benign. Most predictors, including those that are structure-based, are designed for globular proteins in aqueous environments and do not consider that the variant residue is located within the membrane. We report Missense3D-TM that provides a structure-based assessment of the impact of a missense variant located within a membrane. On a dataset of 2,078 pathogenic and 1,060 benign variants, spanning 711 proteins from 706 structures, Missense3D-TM achieved an accuracy of 66%, Mathews correlation coefficient of 0.37, sensitivity of 58% and specificity of 81%. Missense3D-TM performed similarly to mCSM-membrane: accuracy 66% vs 61% (p=0.02) on an unbalanced test set and 70% vs 67% (p=0.20) on a balanced test set. The Missense3D-TM website provides an analysis of the structural effects of the variant along with its predicted position within the membrane. The web server is available at http://missense3d.bc.ic.ac.uk/.

Journal article

Ittisoponpisan S, Yahangkiakan S, Sternberg M, David Aet al., 2022, The SARS-CoV-2 infections in Thailand: Analysis of spike mutations complemented by protein structure insights, Songklanakarin Journal of Science and Technology, Vol: 44, Pages: 1201-1208, ISSN: 0125-3395

Thailand was the first country outside China to officially report COVID-19 cases. With a large number of SARS-CoV-2 sequences collected from patients, the effects of many genetic variations, especially those unique to Thai strains, are yet to be elucidated. In this study, we analyzed 439,197 sequences of the SARS-CoV-2 spike protein collected from NCBI and GISAIDdatabases. 595 sequences were from Thailand and contained 52 amino acid mutations, of which 6 had not been observed outside Thailand (p.T51N, p.P57T, p.I68R, p.S205T, p.K278T, p.G832C). These mutations were not predicted to be of concern. We demonstrate that p.D614G became the prevalent strain during the second outbreak, and the most common spike mutations detected in Thailand (p.A829T, p.S459F and p.S939F) do not appear to cause any major structural change to the spike trimer or the spike-ACE2 interaction. Among the spike mutations identified in Thailand was p.N501T. This mutation was not predicted to increase SARS-CoV-2 binding, in contrast to the spike mutation of interest p.N501Y. In conclusion, Thailand-specific mutations are unlikely to increase the fitness of SARS-CoV-2. The insights obtained from this study could aid in prioritizing SARS-CoV-2variants and in strain surveillance.

Journal article

Ittisoponpisan S, Yahangkiakan S, Sternberg MJE, David Aet al., 2022, The SARS-CoV-2 infections in Thailand: Analysis of spike mutations complemented by protein structure insights, Songklanakarin Journal of Science and Technology, Vol: 44, Pages: 1201-1208, ISSN: 0125-3395

Thailand was the first country outside China to officially report COVID-19 cases. With a large number of SARS-CoV-2 sequences collected from patients, the effects of many genetic variations, especially those unique to Thai strains, are yet to be elucidated. In this study, we analyzed 439,197 sequences of the SARS-CoV-2 spike protein collected from NCBI and GISAID databases. 595 sequences were from Thailand and contained 52 amino acid mutations, of which 6 had not been observed outside Thailand (p.T51N, p.P57T, p.I68R, p.S205T, p.K278T, p.G832C). These mutations were not predicted to be of concern. We demonstrate that p.D614G became the prevalent strain during the second outbreak, and the most common spike mutations detected in Thailand (p.A829T, p.S459F and p.S939F) do not appear to cause any major structural change to the spike trimer or the spike-ACE2 interaction. Among the spike mutations identified in Thailand was p.N501T. This mutation was not predicted to increase SARS-CoV-2 binding, in contrast to the spike mutation of interest p.N501Y. In conclusion, Thailand-specific mutations are unlikely to increase the fitness of SARS-CoV-2. The insights obtained from this study could aid in prioritizing SARS-CoV-2 variants and in strain surveillance.

Abstract
Cite

Journal article

Malladi S, Powell HR, David A, Islam SA, Copeland MM, Kundrotas PJ, Sternberg MJE, Vakser Iet al., 2022, GWYRE: A resource for mapping variants onto experimental and modeled structures of human protein complexes, Journal of Molecular Biology, Vol: 434, ISSN: 0022-2836

Rapid progress in structural modeling of proteins and their interactions is powered by advances in knowledge-based methodologies along with better understanding of physical principles of protein structure and function. The pool of structural data for modeling of proteins and protein–protein complexes is constantly increasing due to the rapid growth of protein interaction databases and Protein Data Bank. The GWYRE (Genome Wide PhYRE) project capitalizes on these developments by advancing and applying new powerful modeling methodologies to structural modeling of protein–protein interactions and genetic variation. The methods integrate knowledge-based tertiary structure prediction using Phyre2 and quaternary structure prediction using template-based docking by a full-structure alignment protocol to generate models for binary complexes. The predictions are incorporated in a comprehensive public resource for structural characterization of the human interactome and the location of human genetic variants. The GWYRE resource facilitates better understanding of principles of protein interaction and structure/function relationships. The resource is available at http://www.gwyre.org.

Journal article

Casadio R, Mathews DH, Sternberg MJE, 2022, Computational Resources for Molecular Biology 2022, JOURNAL OF MOLECULAR BIOLOGY, Vol: 434, ISSN: 0022-2836

Author Web Link
Cite
Citations: 1

Journal article

Casadio R, Mathews DH, Sternberg MJE, 2022, Computational Resources for Molecular Biology 2022, Journal of Molecular Biology, Vol: 434, ISSN: 0022-2836

Journal article

McGreig J, Uri H, Antczak M, Michaelis M, Sternberg M, Wass Met al., 2022, 3DLigandSite: Structure-based prediction of protein-ligand binding sites, Nucleic Acids Research, Vol: 50, Pages: W13-W1=20, ISSN: 0305-1048

3DLigandSite is a web tool for the prediction of ligand-binding sites in proteins. Here, we report a significant update since the first release of 3DLigandSite in 2010. The overall methodology remains the same, with candidate binding sites in proteins inferred using known binding sites in related protein structures as templates. However, the initial structural modelling step now uses the newly available structures from the AlphaFold database or alternatively Phyre2 when AlphaFold structures are not available. Further, a sequence-based search using HHSearch has been introduced to identify template structures with bound ligands that are used to infer the ligand-binding residues in the query protein. Finally, we introduced a machine learning element as the final prediction step, which improves the accuracy of predictions and provides a confidence score for each residue predicted to be part of a binding site. Validation of 3DLigandSite on a set of 6416 binding sites obtained 92% recall at 75% precision for non-metal binding sites and 52% recall at 75% precision for metal binding sites. 3DLigandSite is available at https://www.wass-michaelislab.org/3dligandsite. Users submit either a protein sequence or structure. Results are displayed in multiple formats including an interactive Mol* molecular visualization of the protein and the predicted binding sites.

Journal article

David A, Parkinson N, Peacock TP, Pairo-Castineira E, Khanna T, Cobat A, Tenesa A, Sancho-Shimizu V, Casanova J-L, Abel L, Barclay WS, Baillie JK, Sternberg MJEet al., 2022, A common TMPRSS2 variant has a protective effect against severe COVID-19, Current Research in Translational Medicine, Vol: 70, ISSN: 2452-3186

Background: The human protein transmembrane protease serine type 2 (TMPRSS2) plays a key role in SARS-CoV-2 infection, as it is required to activate the virus’ spike protein, facilitating entry into target cells. We hypothesized that naturally-occurring TMPRSS2 human genetic variants affecting the structure and function of the TMPRSS2 protein may modulate the severity of SARS-CoV-2 infection.Methods: We focused on the only common TMPRSS2 non-synonymous variant predicted to be damaging (rs12329760 C>T, p.V160M), which has a minor allele frequency ranging from from 0.14 in Ashkenazi Jewish to 0.38 in East Asians. We analysed the association between the rs12329760 and COVID-19 severity in 2,244 critically ill patients with COVID-19 from 208 UK intensive care units recruited as part of the GenOMICC (Genetics Of Mortality In Critical Care) study. Logistic regression analyses were adjusted for sex, age and deprivation index. For in vitro studies, HEK293 cells were co-transfected with ACE2 and either TMPRSS2 wild type or mutant (TMPRSS2V160M). A SARS-CoV-2 pseudovirus entry assay was used to investigate the ability of TMPRSS2V160M to promote viral entry.Results: We show that the T allele of rs12329760 is associated with a reduced likelihood of developing severe COVID-19 (OR 0.87, 95%CI:0.79-0.97, p=0.01). This association was stronger in homozygous individuals when compared to the general population (OR 0.65, 95%CI:0.50-0.84, p=1.3 × 10−3). We demonstrate in vitro that this variant, which causes the amino acid substitution valine to methionine, affects the catalytic activity of TMPRSS2 and is less able to support SARS-CoV-2 spike-mediated entry into cells.Conclusion: TMPRSS2 rs12329760 is a common variant associated with a significantly decreased risk of severe COVID-19. Further studies are needed to assess the expression of TMPRSS2 across different age groups. Moreover, our results identify TMPRSS2 as a promising drug target, with a potential role for

Journal article

Varadi M, Anyango S, Armstrong D, Berrisford J, Choudhary P, Deshpande M, Nadzirin N, Nair SS, Pravda L, Tanweer A, Al-Lazikani B, Andreini C, Barton GJ, Bednar D, Berka K, Blundell T, Brock KP, Carazo JM, Damborsky J, David A, Dey S, Dunbrack R, Recio JF, Fraternali F, Gibson T, Helmer-Citterich M, Hoksza D, Hopf T, Jakubec D, Kannan N, Krivak R, Kumar M, Levy ED, London N, Macias JR, Srivatsan MM, Marks DS, Martens L, McGowan SA, McGreig JE, Modi V, Parra RG, Pepe G, Piovesan D, Prilusky J, Putignano V, Radusky LG, Ramasamy P, Rausch AO, Reuter N, Rodriguez LA, Rollins NJ, Rosato A, Rubach P, Serrano L, Singh G, Skoda P, Sorzano COS, Stourac J, Sulkowska JI, Svobodova R, Tichshenko N, Tosatto SCE, Vranken W, Wass MN, Xue D, Zaidman D, Thornton J, Sternberg M, Orengo C, Velankar Set al., 2021, PDBe-KB: collaboratively defining the biological context of structural data, Nucleic Acids Research, Vol: 50, Pages: D534-D542, ISSN: 0305-1048

The Protein Data Bank in Europe – Knowledge Base (PDBe-KB, https://pdbe-kb.org) is an open collaboration between world-leading specialist data resources contributing functional and biophysical annotations derived from or relevant to the Protein Data Bank (PDB). The goal of PDBe-KB is to place macromolecular structure data in their biological context by developing standardised data exchange formats and integrating functional annotations from the contributing partner resources into a knowledge graph that can provide valuable biological insights. Since we described PDBe-KB in 2019, there have been significant improvements in the variety of available annotation data sets and user functionality. Here, we provide an overview of the consortium, highlighting the addition of annotations such as predicted covalent binders, phosphorylation sites, effects of mutations on the protein structure and energetic local frustration. In addition, we describe a library of reusable web-based visualisation components and introduce new features such as a bulk download data service and a novel superposition service that generates clusters of superposed protein chains weekly for the whole PDB archive.

Journal article

David A, Islam S, Tankhilevich E, JE Sternberg Met al., 2021, The AlphaFold database of protein structures: a biologist’s guide, Journal of Molecular Biology, Vol: 434, Pages: 167336-167336, ISSN: 0022-2836

AlphaFold, the deep learning algorithm developed by DeepMind, recently released the three-dimensional models of the whole human proteome to the scientific community. Here we discuss the advantages, limitations and the still unsolved challenges of the AlphaFold models from the perspective of a biologist, who may not be an expert in structural biology.

Journal article

Kelley LA, Powell HR, Sternberg MJE, 2021, Le mieux est l'enemi du bon; homology modelling with Phyre2 in a deep learning world, XXV General Assembly and Congress of the International Union of Crystallography - IUCr 2021, Publisher: INT UNION CRYSTALLOGRAPHY, Pages: C95-C95, ISSN: 2053-2733

Conference paper

Casadio R, Lenhard B, Sternberg MJE, 2021, Computational Resources for Molecular Biology 2021, JOURNAL OF MOLECULAR BIOLOGY, Vol: 433, ISSN: 0022-2836

Journal article

David A, Khanna T, Hanna G, Sternberg Met al., 2021, Missense3D-DB web catalogue: an atom-based analysis and repository of 4M human protein-coding genetic variants, Human Genetics, Vol: 140, Pages: 805-812, ISSN: 0340-6717

The interpretation of human genetic variation is one of the greatest challenges of modern genetics. New approaches are urgently needed to prioritize variants, especially those that are rare or lack a definitive clinical interpretation. We examined 10,136,597 human missense genetic variants from GnomAD, ClinVar and UniProt. We were able to perform large-scale atom-based mapping and phenotype interpretation of 3,960,015 of these variants onto 18,874 experimental and 84,818 in house predicted three-dimensional coordinates of the human proteome. We demonstrate that 14% of amino acid substitutions from the GnomAD database that could be structurally analysed are predicted to affect protein structure (n = 568,548, of which 566,439 rare or extremely rare) and may, therefore, have a yet unknown disease-causing effect. The same is true for 19.0% (n = 6266) of variants of unknown clinical significance or conflicting interpretation reported in the ClinVar database. The results of the structural analysis are available in the dedicated web catalogue Missense3D-DB (http://missense3d.bc.ic.ac.uk/). For each of the 4 M variants, the results of the structural analysis are presented in a friendly concise format that can be included in clinical genetic reports. A detailed report of the structural analysis is also available for the non-experts in structural biology. Population frequency and predictions from SIFT and PolyPhen are included for a more comprehensive variant interpretation. This is the first large-scale atom-based structural interpretation of human genetic variation and offers geneticists and the biomedical community a new approach to genetic variant interpretation.

Journal article

David A, Barbié V, Attimonelli M, Preste R, Makkonen E, Marjonen H, Lindstedt M, Kristiansson K, Hunt SE, Cunningham F, Lappalainen I, Sternberg MJEet al., 2021, Annotation and curation of human genomic variations: an ELIXIR Implementation Study [version 1; peer review: 1 approved with reservations], F1000Research, Vol: 9, Pages: 1-11, ISSN: 2046-1402

Background: ELIXIR is an intergovernmental organization, primarilybased around European countries, established to host life science resources, including databases, software tools, training material and cloud storage for the scientific community under a single infrastructure. Methods: In 2018, ELIXIR commissioned an international survey on the usage of databases and tools for annotating and curating human genomic variants with the aim of improving ELIXIR resources. The 27-question survey was made available on-line between September and December 2018 to rank the importance and explore the usage and limitations of a wide range of databases and tools for annotating and curating human genomic variants, including resources specific for next generation sequencing, research into mitochondria and protein structure. Results: Eighteen countries participated in the survey and a total of 92 questionnaires were collected and analysed. Most respondents (89%, n=82) were from academia or a research environment. 51% (n=47) ofrespondents gave answers on behalf of a small research group (<10 people), 33% (n=30) in relation to individual work and 16% (n=15) on behalf of a large group (>10 people). The survey showed that the scientific community considers several resources supported by ELIXIR crucial or very important. Moreover, it showed that the work done by ELIXIR is greatly valued. In particular, most respondents acknowledged the importance of key features and benefits promoted by ELIXIR, such as the verified scientific quality and maintenance of ELIXIR-approved resources. Conclusions ELIXIR is a “one-stop-shop” that helps researchers identify the most suitable, robust and well-maintained bioinformatics resources for delivering their research tasks

Abstract
Cite

Journal article

Singh A, Dauzhenka T, Kundrotas PJ, Sternberg MJE, Vakser IAet al., 2020, Application of docking methodologies to modeled proteins, Proteins: Structure, Function, and Bioinformatics, Vol: 88, Pages: 1180-1188, ISSN: 0887-3585

Protein docking is essential for structural characterization of protein interactions. Besides providing the structure of protein complexes, modeling of proteins and their complexes is important for understanding the fundamental principles and specific aspects of protein interactions. The accuracy of protein modeling, in general, is still less than that of the experimental approaches. Thus, it is important to investigate the applicability of docking techniques to modeled proteins. We present new comprehensive benchmark sets of protein models for the development and validation of protein docking, as well as a systematic assessment of free and template‐based docking techniques on these sets. As opposed to previous studies, the benchmark sets reflect the real case modeling/docking scenario where the accuracy of the models is assessed by the modeling procedure, without reference to the native structure (which would be unknown in practical applications). We also expanded the analysis to include docking of protein pairs where proteins have different structural accuracy. The results show that, in general, the template‐based docking is less sensitive to the structural inaccuracies of the models than the free docking. The near‐native docking poses generated by the template‐based approach, typically, also have higher ranks than those produces by the free docking (although the free docking is indispensable in modeling the multiplicity of protein interactions in a crowded cellular environment). The results show that docking techniques are applicable to protein models in a broad range of modeling accuracy. The study provides clear guidelines for practical applications of docking to protein models.

Journal article

Wodak SJ, Velankar S, Sternberg MJE, 2020, Modeling protein interactions and complexes in CAPRI 7th CAPRI evaluation meeting April 3-5 EMBL-EBI, Hinxton UK., Proteins: Structure, Function, and Bioinformatics, Vol: 88, Pages: 913-915, ISSN: 0887-3585

Journal article

Mancini A, Howard SR, Marelli F, Cabrera CP, Barnes MR, Sternberg MJ, Leprovots M, Hadjidemetriou I, Monti E, David A, Wehkalampi K, Oleari R, Lettieri A, Vezzoli V, Vassart G, Cariboni A, Bonomi M, Garcia MI, Guasti L, Dunkel Let al., 2020, LGR4 deficiency results in delayed puberty through impaired Wnt/β-catenin signaling, JCI insight, Vol: 5, Pages: 1-17, ISSN: 2379-3708

The initiation of puberty is driven by an upsurge in hypothalamic gonadotropin-releasing hormone (GnRH) secretion. In turn, GnRH secretion upsurge depends on the development of a complex GnRH neuroendocrine network during embryonic life. Although delayed puberty (DP) affects up to 2% of the population, is highly heritable, and is associated with adverse health outcomes, the genes underlying DP remain largely unknown. We aimed to discover regulators by whole-exome sequencing of 160 individuals of 67 multigenerational families in our large, accurately phenotyped DP cohort. LGR4 was the only gene remaining after analysis that was significantly enriched for potentially pathogenic, rare variants in 6 probands. Expression analysis identified specific Lgr4 expression at the site of GnRH neuron development. LGR4 mutant proteins showed impaired Wnt/β-catenin signaling, owing to defective protein expression, trafficking, and degradation. Mice deficient in Lgr4 had significantly delayed onset of puberty and fewer GnRH neurons compared with WT, whereas lgr4 knockdown in zebrafish embryos prevented formation and migration of GnRH neurons. Further, genetic lineage tracing showed strong Lgr4-mediated Wnt/β-catenin signaling pathway activation during GnRH neuron development. In conclusion, our results show that LGR4 deficiency impairs Wnt/β-catenin signaling with observed defects in GnRH neuron development, resulting in a DP phenotype.

Journal article

David A, Sternberg M, 2020, Structure, function and variants analysis of the androgen-regulated TMPRSS2, a drug target candidate for COVID-19 infection, bioRxiv

Cite

Journal article

Lenhard B, Sternberg MJE, 2020, Computational Resources for Molecular Biology: Special Issue 2020, JOURNAL OF MOLECULAR BIOLOGY, Vol: 432, Pages: 3361-3363, ISSN: 0022-2836

Journal article

David A, 2020, A polygenic biomarker to identify patients with severe hypercholesterolemia of polygenic origin, Molecular Genetics and Genomic Medicine, Vol: 8, Pages: 1-9, ISSN: 2324-9269

BackgroundSevere hypercholesterolemia (HC, LDL‐C > 4.9 mmol/L) affects over 30 million people worldwide. In this study, we validated a new polygenic risk score (PRS) for LDL‐C.MethodsSummary statistics from the Global Lipid Genome Consortium and genotype data from two large populations were used.ResultsA 36‐SNP PRS was generated using data for 2,197 white Americans. In a replication cohort of 4,787 Finns, the PRS was strongly associated with the LDL‐C trait and explained 8% of its variability (p = 10–41). After risk categorization, the risk of having HC was higher in the high‐ versus low‐risk group (RR = 4.17, p < 1 × 10−7). Compared to a 12‐SNP LDL‐C raising score (currently used in the United Kingdom), the PRS explained more LDL‐C variability (8% vs. 6%). Among Finns with severe HC, 53% (66/124) versus 44% (55/124) were classified as high risk by the PRS and LDL‐C raising score, respectively. Moreover, 54% of individuals with severe HC defined as low risk by the LDL‐C raising score were reclassified to intermediate or high risk by the new PRS.ConclusionThe new PRS has a better predictive role in identifying HC of polygenic origin compared to the currently available method and can better stratify patients into diagnostic and therapeutic algorithms.

Journal article

Singh A, Dauzhenka T, Kundrotas P, Sternberg MJE, Vakser Iet al., 2020, Application of Docking to Protein Models, 64th Annual Meeting of the Biophysical-Society, Publisher: CELL PRESS, Pages: 360A-360A, ISSN: 0006-3495

Conference paper

PDBe-KB consortium, 2020, PDBe-KB: a community-driven resource for structural and functional annotations, Nucleic Acids Research, Vol: 48, Pages: D344-D353, ISSN: 0305-1048

The Protein Data Bank in Europe-Knowledge Base (PDBe-KB, https://pdbe-kb.org) is a community-driven, collaborative resource for literature-derived, manually curated and computationally predicted structural and functional annotations of macromolecular structure data, contained in the Protein Data Bank (PDB). The goal of PDBe-KB is two-fold: (i) to increase the visibility and reduce the fragmentation of annotations contributed by specialist data resources, and to make these data more findable, accessible, interoperable and reusable (FAIR) and (ii) to place macromolecular structure data in their biological context, thus facilitating their use by the broader scientific community in fundamental and applied research. Here, we describe the guidelines of this collaborative effort, the current status of contributed data, and the PDBe-KB infrastructure, which includes the data exchange format, the deposition system for added value annotations, the distributable database containing the assembled data, and programmatic access endpoints. We also describe a series of novel web-pages-the PDBe-KB aggregated views of structure data-which combine information on macromolecular structures from many PDB entries. We have recently released the first set of pages in this series, which provide an overview of available structural and functional information for a protein of interest, referenced by a UniProtKB accession.

Journal article

Waman VP, Blundell TL, Buchan DWA, Gough J, Jones D, Kelley L, Murzin A, Pandurangan AP, Sillitoe I, Sternberg M, Torres P, Orengo Cet al., 2020, The Genome3D Consortium for Structural Annotations of Selected Model Organisms, PROTEIN STRUCTURE PREDICTION, 4 EDITION, Vol: 2165, Pages: 27-67, ISSN: 1064-3745

Author Web Link
Cite
Citations: 1

Journal article

Leal Ayala LG, David A, Jarvelin MR, Sebert S, Ruddock M, Karhunen V, Seaby E, Hoggart C, Sternberg MJEet al., 2019, Identification of disease-associated loci using machine learning for genotype and network data integration, Bioinformatics, Vol: 35, Pages: 5182-5190, ISSN: 1367-4803

MotivationIntegration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci.ResultsWe developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals’ ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user’s research needs.

Journal article

Sillitoe I, Andreeva A, Blundell TL, Buchan DWA, Finn RD, Gough J, Jones D, Kelley LA, Paysan-Lafosse T, Lam SD, Murzin AG, Pandurangan AP, Salazar GA, Skwark MJ, Sternberg MJE, Velankar S, Orengo Cet al., 2019, Genome3D: integrating a collaborative data pipeline to expand the depth and breadth of consensus protein structure annotation, Nucleic Acids Research, Vol: 48, Pages: D314-D319, ISSN: 0305-1048

Genome3D (https://www.genome3d.eu) is a freely available resource that provides consensus structural annotations for representative protein sequences taken from a selection of model organisms. Since the last NAR update in 2015, the method of data submission has been overhauled, with annotations now being 'pushed' to the database via an API. As a result, contributing groups are now able to manage their own structural annotations, making the resource more flexible and maintainable. The new submission protocol brings a number of additional benefits including: providing instant validation of data and avoiding the requirement to synchronise releases between resources. It also makes it possible to implement the submission of these structural annotations as an automated part of existing internal workflows. In turn, these improvements facilitate Genome3D being opened up to new prediction algorithms and groups. For the latest release of Genome3D (v2.1), the underlying dataset of sequences used as prediction targets has been updated using the latest reference proteomes available in UniProtKB. A number of new reference proteomes have also been added of particular interest to the wider scientific community: cow, pig, wheat and mycobacterium tuberculosis. These additions, along with improvements to the underlying predictions from contributing resources, has ensured that the number of annotations in Genome3D has nearly doubled since the last NAR update article. The new API has also been used to facilitate the dissemination of Genome3D data into InterPro, thereby widening the visibility of both the annotation data and annotation algorithms.

Journal article

Lenhard B, Sternberg MJE, 2019, Computation resources for molecular biology: Special issue 2019, Journal of Molecular Biology, Vol: 431, Pages: 2395-2397, ISSN: 0022-2836

Journal article

Ittisoponpisan S, Islam S, Khanna T, Alhuzimi E, David A, Sternberg Met al., 2019, Can predicted protein 3D-structures provide reliable insights into whether missense variants are disease-associated?, Journal of Molecular Biology, Vol: 431, Pages: 2197-2212, ISSN: 0022-2836

Knowledge of protein structure can be used to predict the phenotypic consequence of a missense variant. Since structural coverage of the human proteome can be roughly tripled to over 50% of the residues if homology-predicted structures are included in addition to experimentally determined coordinates, it is important to assess the reliability of using predicted models when analyzing missense variants. Accordingly, we assess whether a missense variant is structurally damaging by using experimental and predicted structures. We considered 606 experimental structures and show that 40% of the 1965 disease-associated missense variants analyzed have a structurally damaging change in the mutant structure. Only 11% of the 2134 neutral variants are structurally damaging. Importantly, similar results are obtained when 1052 structures predicted using Phyre2 algorithm were used, even when the model shares low (< 40%) sequence identity to the template. Thus, structure-based analysis of the effects of missense variants can be effectively applied to homology models. Our in-house pipeline, Missense3D, for structurally assessing missense variants was made available at http://www.sbg.bio.ic.ac.uk/~missense3d

Journal article

ProfessorMichaelSternberg

Contact

Location

Summary