Data sharing and data management plans for grants
FAIR principles
Data arising from research projects should adhere to FAIR principles. It should be:
- Findable
- Accessible
- Interoperable
- Reusable/Reproducible
Many funding bodies, including RCUK funders, the Wellcome Trust and a number of other charities now require the inclusion of a Data Sharing or Data Management Plan as part of grant proposals. Even if it is not mandatory for your funder, a data management plan can form the starting point for formalising a checklist of what data you will generate (types, volumes), how and where you will store it (and for how long) and how you will share and disseminate it (who to, which parts, and when). The Data Sharing/ Management plan can help you to write an effective impact statement, as it should help you consider all the ways by which you will disseminate the results of your research, and can also help with effective grant costing, since it will help you to consider costs associated with storing your data for the time beyond the end of the grant expected by the funder, building infrastructure for managing your data and sharing it, and the time required to prepare and submit it to suitable repositories.
Different funders, different requirements, same principles
Although funders may have different practical requirements for Data sharing/management plans (mandatory, optional, free-text pages or form), they all ask for answers to the same basic questions, in order to ensure that the data arising from your research project are stored appropriately and securely, and are disseminated effectively to the public domain in standard formats, together with sufficient supporting information (metadata) to ensure that they are re-useable.
A check-list table showing the basic requirements for different common College funders is available from The Digital Curation Centre. All RCUK funders, the Wellcome Trust and a number of others all offer specific guidance for their proposals, and some provide a template to complete (e.g. MRC). The JISC-funded Digital Curation Centre also offers an online tool that has standardised templates for all major funders, pre-loaded with the questions to address for those funders, and now also customised versions for Imperial College that offer some pre-completed general policy fields.. We are very familiar with and can offer help with completion of DMP-Online forms for projects generation bio- or bio-medical data and software.
The College has recently produced a great deal of information on research data management and data sharing plans on the College web-site and we refer you there for general information.
data standards accordian
Services we offer
We can help you to prepare individually tailored data management plans for your grant proposals, however to do this effectively, we will need to see your case for support. When requesting help with data plans, don’t forget to send the following information:
- Funder
- Project duration and initial start date
- Submission deadline
- Your internal submission deadlines
- A draft case for support so we have an idea of the project as a whole and can glean the types of data that you will generate, and projected volumes
- A draft data management plan if you have started to complete one
Once we have heard from you, we will be able to help with:
- data volume calculations based on your experimental design and types
- suggestions of suitable standard data formats for sharing/ storage/ publication depending on your experimental types
- suggestions of suitable metadata standards (and later on, how to use/adhere to them) for your experimental types
- suggestions of suitable public repositories for submission of your data (and associated costs for preparing and submitting your data if required)
- costs associated with storage of your active data during project lifetime (based on volume) and later costs for longer term storage and/or archiving if required
- general information on hardware security for our systems (physical security, redundancy, back-up policies etc.)
- suggestions (and associated costs if required) for additional mechanisms for data dissemination e.g. hosting project-based web-site with dynamic data searching and/or visualisation
- suggestions and costings for project-based methods for organising, searching and sharing complex project-based data internally, or later externally – e.g. a project database (if required). This will also include specific security information for that system’s design – e.g. different authority levels, passwords, encryption etc.
- We can also advise on other College resources that may be useful for your project.
FAIR principles
Data arising from research projects should adhere to FAIR principles. It should be:
- Findable
- Accessible
- Interoperable
- Reusable/Reproducible
Your data sharing plan should help you to achieve this, by ensuring that your data are appropriately annotated with the necessary metadata, in standard formats and submitted (where appropriate) to a public repository in a timely fashion. You can find more information about the principles of The FAIR data Guiding Principles from the FAIR Data Publishing Group.
There are a number of technical areas in a data management plan that require specific input about the types and volumes of data that you intend to generate, the formats you will store the data in, the associated standards that will be adhered to within your metadata to ensure that your data are understandable and fully re-useable, and the appropriate public repositories that you may use to disseminate your data in the public domain. These areas tend to be highly discipline-specific. The Life Sciences and Biomedical data areas are particularly rich in the numbers of available public repositories, data standards and recognised data formats. If you are not sure which ones are appropriate for your datasets, we can help.
Officially, information on standards, databases and ontologies for Life Sciences is collated together and regularly updated under the Biosharing banner which takes information from several sources including the Nucleic Acids Research Database issue and the MIBBI set of common data standards. You may find this a useful (if somewhat dense) resource, but it can appear complex at first sight as it also contains information on deprecated standards and descriptions and appears more geared towards developers than users at the moment.
Some of the most commonly used databases, formats and standards for particular experimental types are outlined below. This is just an example selection so if you need help with your specific projects, please get in touch.
Top tips
Write early - When writing a grant, it is usually faster to start filling in the data management plan as you write the case-for-support, as you will need much of the same information about the data you expect to generate for both documents. Leaving it to the last minute tends to lead to missed opportunities both in terms of remembering to include costs for data management/storage/archiving/data curation and submission, and maximising impact statements with respect to disseminating your data/results.
Software are ‘data’ too - There will be some cases where your project is not generating new data as such - for instance a Wellcome Trust Bio-resources or a BBSRC Tools and Resources Development Fund project that are producing new infrastructures and/or software. In these cases you will still need to explain your plans for storing and keeping any third party data secure - for instance input files submitted to a web server by users. New software and databases produced as output from a project should be referenced in the data management plan and information on their storage and distribution should be included (‘Software as Data’). More information on ways of sharing and publishing software and models are included in a separate section.
What are ‘Metadata anyway’? In this context, metadata is the additional structured information about your dataset that explains what the dataset ‘is’, and allows it to be understood and re-used by others. Metadata are context-specific and can be minimal or very rich, depending on what is required for that dataset. For instance it may contain information about:
- the bio-specimen from which a sample is generated (e.g. species, taxonomy, gender, age, tissue, cell-type and the growth conditions
- the experimental protocol used to extract a sample on which to work (e.g. standard operating procedure used, chemical vendor/batch, conditions)
- the experimental design of an assay
- auto-generated information from instrumentation used to make a measurement/generate a dataset (e.g. vendor, model, version, software version) also perhaps manually added experimental parameters/conditions about the assay
- analysis methods - quality assurance methods used, normalisation methods, software used (versions, parameters)
Metadata standards
| Type of experiment/dataset | Standard name | Acronym | More information... |
|---|---|---|---|
| microarray | Minimum information (MI) about a microarray experiment | MIAME | MIAME |
| proteomics | MI about a proteomics experiment | MIAPE -* e.g. MIAPE-MS, MIAPE-Quant | Different MIAPE extensions for different proteomics methods See full listing |
| metabolomics | Core Information for Metabolomics Reporting | CIMR | CIMR |
| RNAi | MI about an RNAi experiment | MIARE | MIARE |
Genome/metagenome |
MI about a Genome (or metagenome) experiment | MIGS/MIMS | MIGS/MIMS |
| Generic NGS including RNA-Seq, ChIP-Seq | MI about a high throughput SEQuencing Experiment | MINSEQE | MINSEQE |
| Glycomics array | MI required for a glycomics experiment | MIRAGE | MIRAGE |
| Simulation | MI about a simulation project | MIASE | MIASE |
| Bio-Model | MI Required In the Annotation of Models | MIRIAM | MIRIAM |
There are so-called Minimum Metadata reporting standards – which aim to list the most important metadata fields needed to accompany a dataset of a certain experimental type in order to make it understandable and re-useable. In many cases, your dataset will need to be complaint with an appropriate data standard before it can be submitted to a public repository (=’public database’). One of the oldest established standards is the MIAMI standard for microarray data (Minimum Information About a Microarray Experiment).
There are currently over 80 minimum metadata standards for different types of biological data, and some are more mature than others (and stable) but the good news is that a relatively few common types serve for most of the more common biological experiment types.
File formats
There are a large number of different file formats in common use for biological data, and some are more stable than others. Generally you should stick to open standard file formats rather than proprietary (i.e. those from commercial vendors) for storing data, since commercial formats may require access to specific versions of commercial software in order to be readable – and this may not be possible in the longer term. Generally, public biological data repositories will only accept data submitted in a specific data format. A few common open file formats are shown below, together with the type of experiment they originate from. Example file formats are also covered in more detail in our help pages.
| Experiment type | Description | Filename | Type/use |
|---|---|---|---|
| Microarray | Affymetrix | cel | Tab-delimited text |
| Other microarray data formats | mev, Stanford | Can contain data from single or many chips. tab-delimited text, but different column orders, degree of commenting | |
| Simple Omnibus Format in Text | SOFT | GEO microarray data exchange format – line based plain text | |
| Next generation sequencing | Binary alignment | BAM | Compressed (binary) version of SAM |
| Sequence alignment/map | SAM | Created by alignment programs | |
| Defining annotation lines on a reference sequence | BED | For visualising annotations in genome browser | |
| ‘wiggle’ format for continuous-valued data in a track format, also binary compressed version (BigWIG) | WIG, BIGWIG | e.g. visualisation of GC percent, probability scores, and transcriptome data on genome sequence | |
| Contains sequence and quality scores | FASTQ | Fasta format sequence and quality data | |
| Variant calling format (variant positions in genome) |
VCF/BCF | Text - Often binary format | |
| Reference-based compression | CRAM | Tuneable binary format for multiple sequences | |
| General feature format | GFF | Placing features on a genome (reference) sequence | |
| Medical imaging | Open file format for medical imaging | DICOM | |
| Confocal microscopy | Tagged image file format (Generic) | TIFF |
Information not changed when format created |
| Joint Photographers Experts Group image format | JPEG | Uses lossy image compression – different compression ratios available | |
| Multipage TIFF with OME XML data block | OME-TIFF | Encodes additional metadata | |
| Proprietary image formats containing microscope-specific metadata | Zeiss LSM, Leica LEI | Instrument or software-specific | |
| Super-resolution microscopy | Tagged spot file format | tsf | Binary format for that methods that generate images by locating the position of single fluorescent emitters |
| Metabolomics - Mass Spectroscopy | Network Common Data Format | netCDF | Machine independent array-oriented binary data format |
| MS and MS/MS proteomics data | mznld | open data format for storage and exchange of mass spectroscopy data | |
| Proprietary examples – Thermo, Bruker, ABI/Agilent |
RAW, Baf, wiff | ||
| Metabolomics - NMR | Self-defining Text Archival and Retrieval format | NMR-STAR | Chemical shift file |
Public bio-data repositories and databases
The NAR online Molecular Biology Database Collection currently lists more than 1550 different databases
Some are organism (or even gene-) specific, some contain secondary data – mined from literature and curated, and there are repositories/databases available for data arising from genomics (and metagenomics), transcriptomics, proteomics, metabolomics, protein structure and imaging studies. There are also public repositories for some types of bio-models. If you are not sure which repository/database is right for your data, or would like help in preparing your datasets for submission, please get in touch.
| Repository | Primary use | Home |
|---|---|---|
| European Nucleotide Archive (ENA) | DNA sequence with/without annotation | https://www.ebi.ac.uk/ena http://www.ebi.ac.uk/ena/submit |
| Short Read Archive (SRA, part of ENA) | NGS raw data (reads) | http://www.ebi.ac.uk/ena/submit |
| ArrayExpress | Transcriptomics – array based and RNA-Seq | https://www.ebi.ac.uk/arrayexpress/ |
| GEO | Transcriptomics – array based and RNA-Seq | http://www.ncbi.nlm.nih.gov/geo/ |
| UNIPROT | Protein sequence with annotation | http://www.uniprot.org/ |
| European Genome Phenome Archive (EGA) | Genomic studies where access to datasets is controlled by an ACCESS COMMITTEE | https://www.ebi.ac.uk/ega/home |
| dbSNP | Small genetic variations (SNP) | http://www.ncbi.nlm.nih.gov/snp |
| PRIDE | Proteomics data | http://www.ebi.ac.uk/pride/ |
| MetaboLights | Metabolomics and related data | http://www.ebi.ac.uk/metabolights/ |
| BioModels | Computational models of biological processes | https://www.ebi.ac.uk/biomodels-main/ |
A word on DOIs - Some types of bio-data still have no established public repositories (e.g. kinetic data). These datasets can still be published with a persistent data identifier (DOI) on a more generalised site, as well as embedding them within supplementary materials for a publication – which may be persistent, but are often difficult to search. A DOI can be used to give a stable identification for datasets as well as its more common use for publications. The College has recently arranged the central ability to mint DOI’s for datasets, as part of its Open Access Policy but this is still relatively new. If you would like to explore how to use this for your biological datasets, please get in touch as we may be able to help.
Other relevant Central College resources - The College version of Symplectic has recently been extended to also allow you to track your dataset publications, and can be linked to your ORCID identifier to assist with unambiguous searching and identification. We recommend that if you are not already familiar with ORCID and updates to Symplectic, that you check out the web pages on ORCID:
The College offers help pages on using a number of generalised repositories including Figshare, Zenodo for bio-datasets that aren’t suitable for an established biological data repository.