Organising your data
To get the best out of the Research Data Store, take some time to plan the organisation of your data. As a guiding principle, try to map each significant data set and project activity to a separate allocation. Allocations have several features and restrictions that you should take into account.
- Sharing Each allocation has a single access control list of users who may access it. Within that, users may be granted read-only or read-write privileges at the file level. If need more complex control, you must create additional allocations.
- Sustainability Each allocation needs to be supported by a funding source, either an active Project or GL account. An allocation can be reassigned to a different funding source at any time, and pre-payments can be accepted. Any allocation that is persistently without funding support will become read-only and eventually be archived.
- Safety A standard allocation provides snapshots, so that you may access earlier versions of file or retrieve deleted ones. There is, however, only one copy of your data, held in a single location. For data assets that are of critical importance to your research, we recommend you store those in a DR allocation, which includes a second copy held off-site for disaster recovery.
- Sensitivity if you are processing sensitive, personally-identifiable data that is subject to regulatory restriction, this should be placed in a secure allocation, rather than a standard one. Secure allocations are functionally equivalent, but are subject to higher levels of audit and are more restricted in the locations they may be accessed from.
- Long-term archive project allocations can be archived for long-term storage once they are no longer in active use. Try to organise your work such that a whole allocation can be archived at once, rather than requiring re-organisation first.
- Naming all RDS project allocations live in a single flat namespace. Try to use systematic, descriptive names for allocations for example Smithlab-reference-genomes-2018, not MyData.
Example scenarios for organising data in the RDS
Example scenarios for organising data in the RDS
Dr Jones is an early stage researcher with a single EPSRC grant. Her work involves developing numerical simulation software on the College HPC systems. She creates an RDS project allocation JonesLab-software to keep the source code of the simulation software that they are developing. As this software is the core of her research she assigns it a DR allocation. (She also creates a repository for her software source code in the College's . The allocation is funded by her EPSRC grant. Testing this software produces larges amounts of output data, but this is generally post-processed shortly after production. This data can be kept in the project allocation's Ephemeral space, where an unlimited amount of data can be held for up to 30 days without charge. By the end of her grant, she has 5TB of code and data in the allocation, of which only the use over 2TB is chargeable. This costs her £8.40/month, which is billed to her project code.
Jones receives a second EPSRC grant that will start a few months after her first finishes. She prepays for those months' use using remaining funds in her first grant,then, when the new one starts, transfers the allocation to that for continued funding. Along with the grant comes a PhD student, Smith, who will be using her software to investigate the properties of a new widget. For this project, she creates a new allocation JonesLab-SmithPhD and gives her student read/write access to that as well as read-only access to the JonesLab-software allocation.Both of her allocations are associated with the same project code, and so only the first benefits from the free 2TB.
Smith spends a pleasant three years simulating widgets, and publishes a couple of papers before moving on to a Postdoc position elsewhere. Smith published some of the reduced data supporting those publications in the College's Data Repository, where it can be accessed via a DOI, but much remains in the RDS allocation. Jones doesn't expect to have to refer to Smith's data again, but the terms of her grant oblige her to keep a long-term copy of data produced and, besides, it might be useful in the future. Jones tidies up the allocation, and then requests that it be archived. There is 20TB of data, which at £100/TB, costs £2000 to archive for a decade. The allocation then becomes read-only, and is automatically relocated to cheaper archival storage. Jones can still get to the data if necessary, but understands it wont be as fast to read as before.
A new centre starts a College Facility, providing genome sequencing services. This Facility uses an RDS allocation SequencerFacility-working to store the data generated by their sequencers. The raw data is stored in the allocation's ephemeral space as it is only required temporarily until the genome assembly is complete. That assembly requires the use of reference genome datasets, which are kept in another allocation SequencerFacility-referencesets. Both of these allocations are associated with the Facility's GL account code. The data is passed back to the commissioning group by putting it into either a new purpose-made RDS allocation SequencerFacility-Customer-XYZor an existing one that Facility staff are given access to, as the researcher prefers. The Facility builds the cost of the RDS storage in the the FEC model they use to price their services.
The Taylor Lab works with genome data for several different species of bug. These datasets are long-term reference sets, and are used by several of the Lab's research projects. They are mostly read-only, but are occasionally added to with new data, or with re-analyses of existing data. Each one of these datasets are kept in a separate RDS allocation, TaylorLab-genome-species-X so that access to them can be granted selectively. Since these datasets are irreplaceable, they are held in DR allocations. These allocations are mostly associated with grants awards for work on those organisms. Some of them aren't the subject of active research but Taylor expects to return to them soon, so she has prepaid for storage using some remaining funding from one grant, and then transferred them to her discretionary F account.
Taylor creates a new allocation for each significant project the group undertakes, limiting access to the members of her (large) group that are involved. Sometimes the research is not fruitful, and the allocations can be erased. Often though, it leads to publication. Taylor mostly archives the supporting data in a third-party, subject-specific data repository, but also uses the College's Data Repository. Once the group finishes a project, the RDS allocations are archived for a one off fee, using funds requested for the purpose by Taylor in her grant application. This costs £100/TB, for a decade of archive. Archived allocations become read-only and slow to access, but the data can be retrieved on demand if it's ever required. Sometimes the project's output is of permanent, long-term use, and that data gets copied back to the species' reference data allocation.
Dr Brown has just moved her group to the College, and has 50TB of data on a NAS appliance that she bought some years ago for all of her group to use. Much of this data is in use, with different data used by different members of her group. Some of the data is old work that is still on the NAS as she has nowhere better to put it. Brown begins by creating a standard RDS allocation Brown-archive-2010-2018 into which she copies all of the old data, and then immediately requests that it be archived. The remaining data is a mix of personal scratch space used by her group members and project-oriented data. For each of the projects she creates a new RDS-Brown-project-XX, associating them with the accounts for her various research projects, with the plan that these can be archived individually as each project comes to a close. After discussion, the group realise that the RDS's Individual allocations, which they get by being active users of the College's HPC, will be sufficient for their personal working scratch space.
RDS allocations should only be used for the purpose stated when they are made. This information will be recorded in the College's Information Asset Register.
Your responsibility to ensure that the data storage and processing your use complies with any requirements or restrictions imposed by any data licensor. If you have any concerns, please contact the Research Computing Service Manager.