Working with Data and Software at Imperial College
Resources and information on working with data and software at Imperial
Welcome to our guide on working with data and software for students, researchers and software engineers at Imperial. On these pages we provide a variety of information and pointers to resources across the College website, and beyond, to help you make the most of the available tools and services and work reliably, efficiently and securely with software and data.
Data and information, in a variety of types and formats, underpins all research in some way. Over several decades, software has also become an increasingly important element of research to the point where software is now present in almost all areas of research. Software provides a way to undertake and automate tasks that would be impractical or even impossible for humans to realistically handle themselves. As technology advances, the ability to capture ever-larger quantities of more complex, higher resolution data is more easily accessible to individual researchers. Software is now a vital element in being able to efficiently process, analyse and extract knowledge from this data, supporting the development of higher quality research outputs.
This material provides a top-level reference to highlight and help you find all the resources, groups and individuals around Imperial that can assist you in working effectively and efficiently with data and software.
Software and Data-related Information and Resources
Storing Research Data
Research data includes not only data generated as outputs from, or inputs to, experiments but also various other forms of data and materials that relate to your research, e.g. slides, questionnaires, protocols, lab notebooks, videos, correspondence etc. With volumes of data growing constantly, finding a suitable location to store data that is both secure (in terms of protection against data loss and against unauthorised access) and easily accessible by the individuals/software that need access to the data for processing, can be challenging.
Things to consider:
- Sensitivity: Is the data sensitive or personally identifiable? Does my chosen storage location provide the required level of security? Are special certifications required and are these present? (e.g. for certain types of medical data)
- Sustainability: What retention policies apply to data at my chosen storage location? Will my data be removed after some period? Does this matter? Do I need a long-term backup elsewhere?
- Sharing: Can I share the data with others? Can I easily move my data from it's storage location to the location(s) where I want to process it? (e.g. Is the data set too large to transfer to a remote processing location in a reasonable amount of time?, etc)
- Safety: Do backups of data need to be stored? If so, where? Might I need to access earlier versions of files?
- Naming: Have I decided on appropriate descriptive and systematic names for my files? (e.g. using appropriate and consistent time and date formats for timestamps used in file names)
- Funder requirements: Have I reviewed any requirements mandated by my funder regarding data storage, long-term archiving, etc?
Data storage options available at Imperial:
- Imperial's Research Data Store for storing and archiving research data
- OneDrive for Business - for storing personal work-related files
- SharePoint - for sharing and collaborating within a group
Data management plans
A data management plan (DMP) describes how you will manage and look after your data throughout the lifecycle of your research project and beyond. Many funding bodies now expect or require researchers to submit a DMP when applying for funding or have one in place before data collection begins. Having a DMP is also good research practice. As well as satisfying funder requirements, a DMP can help you clarify ethical or legal responsibilities, prevent data loss and protect data confidentiality, make it easier for you find and keep track of your data during your project and prepare your data for archiving and sharing at the end of your project.
Available resources and information:
- How to complete a data management plan - online guidance from the College’s Research Data Management team
- DMPonline - a free to use, web-based data management planning tool (see here for details of how to register)
- One-to-one consultations - book a DMP consultation with the Research Data Management team
- DMP reviews and feedback - send a draft copy of your DMP to firstname.lastname@example.org and the RDM team will review it and give feedback
- What does my funder want? – a web page with summaries of funder data management policies and links to relevant policy documents, including DMP templates.
Research data management resources at Imperial
Imperial's Research Data Management team provide a wealth of online resources relating to working with, storing, sharing and archiving data in addition to information on policies, budgeting for research data management and writing data management plans. You can also contact them directly by sending an email to email@example.com
The following links are available to help you at different lifecycle stages of your project:
Before you start:
- What is research data?
- Why manage research data?
- What does my funder require?
- How to complete a data management plan
During your project:
Finishing your project:
Data archiving and sharing
Many funders and an increasing number of journal publishers now expect data that supports published findings to be archived and made widely with as few restrictions as possible. The easiest way to do this is to deposit your data with a reputable data repository. Depositing with a data repository not only ensures the long term preservation and accessibility of your data, it also encourages others to reuse and cite your data and enables you to get credit for your data, just as you would any other published research output.
Things to consider:
- Which repository is best for your data? Are there repositories that are widely used in your research domain? Does your funder recommend a specific repository?
- Which data should you keep? Which data does your funder or journal expect you to archive and/or make available?
- Will there be restrictions on sharing your data (e.g. to protect data confidentiality)? What measures can you take to ensure that any data sharing complies with legal and ethical obligations?
- How will you promote the discovery and reuse of your data?
Available resources and information:
- Finding a research data repository
- What to keep
- Sharing sensitive data
- How to write a data access statement
- What does my funder want?
Working with/storing sensitive data
Working with sensitive data presents a number of challenges around secure storage, access and use. Such data is not uncommon in research work, particularly in areas such as medical research, and various groups at Imperial have extensive experience of working with and managing such data.
- The Scholarly Communication team provide some general advice on storing sensitive and personal data.
- Each Department or Division has a local Data Protection officer who should be a first point of contact for all data protection queries.
- ICT provide some guidance on how to encrypt and protect your data. This is useful guidance but it is not focused specifically on large-scale research data for which you are likely to want to explore and familiarise yourself with other options for protecting your data.
Research data-related training opportunities within Imperial
The Research Data Management team run two data management training sessions for PhD students offered the through the Graduate School at various times throughout the academic year:
- Information Landscape: Data Management - an introduction to research data management
- Information Landscape: Research Data Management Plans - a workshop on how to write a data management using DMPonline
Also relevant for data management are the Graduate School’s Research Computing & Data Science Skills Courses.
The RDM team can also deliver bespoke training workshops to research staff and students as well as academic support staff. Email firstname.lastname@example.org for further details.
Building and preserving research software
Research software (whether a collection of scripts or a major software project) is a fundamental part of research and, along with your data, has to be developed and preserved according to good practices and guidelines. Furthermore, many funding bodies and journals expect sofware to be made publicly available upon publication.
The following sections highlight some things to think about and some links to tools and further information to help you at different stages of a software project.
Before you start:
Successfully building a piece of software (or just a simple script) requires some planning. The process involves a bit more than knowing how to program. Here are some essential considerations:
- What is the software going to do? Is it going to be used for a single task or will it handle multiple processes?
- What are the inputs and outputs? What input parameters and data formats do you need to handle?
- Where can you reuse modules or libraries that others have developed? The rule of thumb is: do not reinvent the wheel! Ask your colleagues or members of the Research Software Engineering community if you'd like some advice about which packages work well for a given task.
- Once you decide on the big picture, start thinking about the structure of your code (see below for more detail).
- Familiarise yourself with good practice for writing code (below) and take training courses. The Graduate School’s Research Computing and Data Science Programme (RCDS) provides courses in over 20 fundamental topics that are open to everyone at the College. The course materials are also available for self-study.
- Read about the Research Software Engineers at the College in the "RSE resources" tab. If you are enthusiastic about coding, you'll find many kindred spirits. The community organises events, provides many resources as well as offering 1-2-1 guidance.
Developing code - advice and best practices:
The following guidelines will help you write good quality code that is shareable, reproducible and sustainable. There are a number of different areas covering essential practices. The fundamentals are higlighted below. If you are a research student, consider starting by viewing a few short videos on the Introduction to RCDS Good Practice page prepared by the Research Computing and Data Science training team at the Graduate School.
- Plan to build your code in incremental steps, structuring it into classes and/or functions/methods as necessary. Ideally each small task will be done using a separate function. Where practical, functions should be generic enough to be reusable. More complicated tasks may result in functions that call several smaller functions. Each function should have a defined input and output that can be tested separately.
- Strike a balance between getting something working and trying to get the best possible code structure. Things will change, don't get stuck with unnecessary complexity right at the start, use an iterative process, you can always "refactor" later.
- If you are working on a larger project, plan to group similar functions/classes together into modules/namespaces. The goal here is to produce well-organised modular software.
- If you're not already familiar with object-oriented (OO) development but you're an experienced programmer, think about adopting this approach, especially if you're already working in an OO language. It can offer significant benefits in software design, especially in larger projects.
- Think about code formatting - concise comments and syntax, meaningful function and variable names - see the "Best practices" tab for more on formatting conventions
- Use a version control system such as Git to manage your files - this offers an efficient way to handle incremental development without storing every version in a separate file (and much more)
- Export your local Git repository to a web-based service such as GitHub - this enables collaboration and sharing (and provides backup)
- Use a suitable editor or an Integrated Development Environment (IDE) (for example Visual Studio Code (VSCode))
- Write tests for your fuctions and run them regularly (invaluable for finding bugs that break the code or (worse!) make it produce the wrong results)
If you are ready for more detail, check out the "Best practice" tab on this page.
Publishing and archiving software:
This is an essential step that will:
- Enable you to receive credit for the use of yor software (and track reuse)
- Help others to reproduce and replicate your results
- Aid collaborations
To learn more, read the concise guide to sharing and publishing software from the Scholarly Communication team.
Best practices for developing research software
There are a number of best practices that can help to ensure that the research software you develop is reliable, sustainable and maintainable. This is not an exhaustive list but we've focused on highlighting practices that we consider to be particularly important and, where relevant, where resources are available within Imperial to support them.
Use a version control system
Using a version control system (VCS) helps you to effectively manage your software development work. Git is currently one of the most widely used version control systems. Other widely used VCSs include Subversion and Mercurial. Unless you're collaborating on a project using another VCS, we recommend using Git. GitHub provides a full web-based source code management environment in addition to hosting git repositories. It also provides software project management tooling including an issue tracker and wiki. Imperial has a GitHub organisation which all members of Imperial can join (see the "How to join the Imperial College GitHub.com organisation" section on the Working with GitHub.com page). You can create public or private repositories within the Imperial GitHub organisation and collaborate with other GitHub users both within and outside the organisation on projects hosted within the Imperial GitHub organisation. In addition, Imperial also has a local GitHub Enterprise deployment for on-premises storage of code/data. The College website's GitHub page provides further detials of the benefits that the College GitHub licence offers.
Write tests for your code (and run them!)
Writing tests for your code is a really important way to ensure that your code works correctly and continues to work. Don't just assume that because something works now, it will always work in the future. Changes you make to one part of your code may affect the output of other parts of your code, for example, where you change a function that is used by several other elements of the code. It's very easy to forget all the places where a piece of code is used within an application as it grows. Also note that writing tests can be tedious and it's easy to get bored or choose not to bother, however, you'll be grateful for your tests when the codebase gets larger and you need to change something! There are a variety of approaches to writing tests, such as test-driven development (TDD) where you write tests first to test the functionality you want to achieve, and then write the code to make those tests pass. All widely used languages have testing frameworks available (often many of them), e.g. PyTest for Python, minitest for Ruby, JUnit for Java, Boost.Test for C++, etc. There are many resources on the web for help with testing and the use of different testing libraries. Also note that test coverage tools exist that can tell you how much of the code in a project is covered by unit tests. In many cases, they'll also tell you specifically which lines of code in your project are not being tested. For example, if you develop software in Python, take a look at Coverage.py.
Use coding conventions for formatting your code to make collaboration and debugging easier
Use Continuous Integration (CI) tooling to automate integration and testing of code updates.
Continuous Integration is the process of frequently merging or integrating changes to a project, potentially from multiple contributors, into a main version of the code. CI frameworks support this process by automatically running various processes each time you commit updates to your code repository. This might, for example, include running a suite of tests and notifying you if any of the tests fail and generating updated software packages from a codebase. Continuous Deployment (CD) involves the automated, continues deployment of updates to a runtime environment. Some widely used CI/CD frameworks include GitHub Actions, Jenkins, Travis CI or Circle CI.
Use web-based collaboration resources
Use tools such as issues and pull requests to track and document bugs, feature requests, and updates to your code. As highlighted above, this functionality is available in GitHub as well as in a number of other tools and software project management platforms.
Optimise your release process
Don't rely on manually having to create/build software packages every time you create a new release of your code. Use online tooling integrated with your code repository, such as GitHub Actions, to trigger building of packages when you set up a new release or merge new commits into your mainline code. Use tools such as CMake/CPack (C/C++) or setuptools (Python) to help with the process of building packages.
Research Software Engineering (RSE) resources
Research Software Engineering (RSE) has developed over recent years to represent individuals working in the research community whose work focuses on building software rather than other research-related tasks. First coined back in 2012, the term Research Software Engineer is increasingly found as a job title for roles focusing on software development within research organisations. Research Software Engineering teams/groups have now been set up at a number of institutions around the UK and internationally and the importance of the role continues to grow as there is increasing call for software development expertise in the research community.
Here we provide some links to research software resources relevant to members of the College:
Imperial Research Software Community
Imperial's Research Software Community was set up in 2015 and continues to run events and training, and to provide information and advice to researchers and RSEs at Imperial. With over 200 people on the community's mailing list and Slack workspace, we're always happy to hear from new members and to receive suggestions on events you'd like to see us run or ways that we can enhance our support for research software at Imperial.
The community produces a monthly newsletter which you can view online. The newsletter is also sent out to our mailing list. If you'd like to find out more about the Research Software Community or have any questions, you can contact the community's management committee. A series of useful research computing tips are available from the central RSE team and members of the RSE community.
Imperial's Research Software Team
Imperial has a Research Software Team based within the Research Computing Service in ICT. The RSE team includes a group of experienced research software developers with research backgrounds in a variety of different domains. In addition to developing high-quality code for research projects, the team understand the research lifecycle and are experienced at collaborating with researchers to develop technical solutions to research challenges. If you have a project that you're interested in working with the team on, request an informal consultation with the team to discuss this.
Imperial Research Software Directory
Imperial's Research Software Directory is a directory of open source software projects developed at, or in collaboration with, Imperial. While the directory is growing and already contains an impressive list of projects, we're aware that this list is still likely to represent a small proportion of the open source software developed by individuals based at Imperial. If you'd like to have your project included in the directory, get in touch with the RSE Team to have your project added.
Getting credit for research software outputs
One of the reasons behind the development of RSE and the research software engineering community has been the relative lack of recognition afforded to software outputs within research, and to the individuals who produce them. This has, in turn, often resulted in a lack of career opportunities and sustainability for researchers and software engineers undertaking RSE work, despite its importance to research outputs.
The College provides some resources to help address this challenge and provide some guidance on how you can better promote and get recognition for your software outputs. Imperial is a signatory of DORA, the San Francisco Declaration on Research Assessment (see more on this page on Research Evaluation), which also helps to highlight and work towards addressing issues in the way that research is assessed, something that it is hoped will lead to better understanding and recognition of software as a valuable research output. The Research Data Management team also provide this very useful resource on "Making research software open and shareable" which provides a number of tips on how to make your software discoverable and shareable, helping to provide more opportunities to get credit for your research software outputs.
An important element within the RSE community is training. There are many opportunities to get research software training that can help you to produce better, more sustainable, reproducible and reliable software. The Carpentries provide a number of training courses covering core skills in building software and working with data. Software Carpentry workshops are run regularly at Imperial covering programming in Python and R. See the RCS's Training page for more details on these and other courses run at Imperial.
In addition the Graduate School's Research Computing and Data Science Skills courses are now available to researchers as well as graduate students.
For further details on where to look for training courses, see the "Training" section under the "External resources" tab.
This section contains some links to external resources related to research software and research data which you might find useful.
The list will be updated with new links as and when relevant resources are discovered. If you have something you'd like to see added to this page, please email the RSE Community Committee with the subject "Software and Data Resources links" to have one or more links included in this list.
- The Turing way - higly recommended guide to reproducible, ethical, inclusive and collaborative data science
- Data Carpentry - data skills training material from The Carpentries
- Software Carpentry - training in fundamentals of research computing
- CodeRefinery - a set of training courses covering a variety of research software-related topics including testing, reproducibility and using CMake to build portable code.
- The Research Data Alliance - community-driven initiative for development of solutions to data sharing problems
- JISC provide this information for institutions on managing research data
- EDiNA's MANTRA Research Data Management Training free online course The Digital Curation Centre (DCC)
- The UK Data Archive - home to the UK's largest copllection of social, economic and population data