AMCG employs a full time operations manager (Dr. Tim Greaves) to oversee all aspects of its computational infrastructure and to ensure researchers can utlilise all their time effectively.

Machines

Linux network

AMCG regularly purchases new technology in the form of user workstations. Our current specification for desktop systems is quad-core E3 (Sandy Bridge) hyperthreaded CPUs with 16GB RAM and NVidia Quadro K600 GPUs with widescreen monitors. These give a good base for development, and a good platform from which to migrate jobs onto the central college compute clusters.

Workstations communicate over a fast ethernet network, picking up authentication and user information from a group LDAP server, and home and scratch filespace over NFS. Each workstation is installed with a base Ubuntu LTS install on top of which is installed a heavily customised environment from the group package repository. The workstation build is streamlined such that a new workstation can be fully (re)installed and brought online in a very short time with an identical environment to all other workstations.

As satellites to the Linux workstation network many users and developers own laptops which are set up with a Linux install matching that on the main workstations.

Imperial College HPC clusters

AMCG make extensive use of the centrally-supported Imperial College HPC clusters, particularly the large heterogeneous Dell PC cluster and the SGI Altix ICE 8200 EX system which at 32.14TFlops was recently rated 439 in the top 500 list.

We primarily use the Dell PC cluster for development work and testing, and have a dedicated queue with our own hardware on for the exclusive use of our Buildbot build and regression testing. For production work with runs on 64 or more cores we use the Altix cluster.

Visualisation and diagnostics

The ParaView open source package http://www.paraview.org is our primary means of visualising data on workstations, especially for larger data sets resulting from parallel simulations. We also make use of Mayavi http://code.enthought.com/projects/mayavi in which we have developed a number of filters and diagnostic tools for processing data.

Build, unit and regression testing

With more than 20 active developers regularly committing to the central sourcecode repository, it is important that code quality and correctness is maintained. We use the open-source buildbot (http://buildbot.net/) software to fulfill this requirement.

GitHub triggers couple the repository to the buildbot, which automatically runs a comprehensive set of serial and parallel build, unit, verification and validation (V&V) tests using the GCC 4.8 and Intel 2015 compilers every time a change is committed to the repository.

A large number of tests with analytical solutions, which have benchmark data available in the literature, or which are constructed using the [method of manufactured solutions|http://dx.doi.org/10.1115/1.1436090] are used for model V&V purposes.

All users and developers are required to add testing code specifically for their area of application and development. Tests can be added retrospectively and a simple script performs recursive bisection to locate when in the source history a fault developed. Using this system, faults are generally detected within minutes of being committed to the GitHub repositories. As the number of tests increases, the software gets provably better over time.

As part of the test suite, profiling information is collected to benchmark the performance of the code over time. We collect this data in a database, allowing us to query the time spent in individual subroutines for a specific problem as a function of revision number. This data is exploited to guide optimisation and ensure that the code becomes progressively faster and scales better.

The buildbot network is managed from a dedicated system which sends serial and parallel jobs to a dedicated queue on the central CX1 cluster.

To view the output from buildbot, see http://buildbot.ese.ic.ac.uk:8080.

Options management and preprocessing

The interfaces by which users specify the scenarios to be simulated by scientific computer models are frequently primitive, under-documented and ad-hoc text files which make using the model in question difficult and error-prone and significantly increase the development cost of the model. We have developed a model-independent system, Spud, which formalises the specification of model input formats in terms of formal grammars. This is combined with an automated graphical user interface which guides users to create valid model inputs based on the grammar provided, and a generic options reading module which minimises the development cost of adding model options.

Together, this provides a user friendly, well documented, self validating user interface which is applicable to a wide range of scientific models and which minimises the developer input required to maintain and extend the model interface.

Further details may be found here, and in this paper:

Ham DA, Farrell PE, Gorman GJ, Maddison JR, Wilson CR, Kramer SC, Shipton J, Collins GS, CotterCJ, Piggott MD, Spud 1.0: generalising and automating the user interfaces of scientific computer models, Geoscientific Model Development, Submitted.http://www.geosci-model-dev-discuss.net/1/125/2008/gmdd-1-125-2008.html

Good practice

AMCG works on a hierarchical structure for processor time and data storage.

At the core of the network are key servers for ssh access to the network, web services via the IC CMS, IC Wiki, and GitHub Issues for committing and disseminating information, RAIDed file services via NFS for redundant storage of data, and IC LDAP services for authentication and user information.  Critical data (particularly buildbot and other service configurations) are automatically checked in to SVN on a daily basis.

Data storage exists at multiple discrete levels depending on the value of the data.

At the most critical level where data cannot be regenerated at less than the per-user time cost of the initial data creation, data is checked in to SVN and backed up centrally by the college. SVN repositories exist for system data, group data (central repositories for all group programming), and personal data
(every user can obtain a personal SVN repository for committing papers, reports, and key files associated with daily work).

At a level of valuable data which would be very computationally time-consuming to regenerate or which would take moderate amounts of time on a per-user basis to recreate data is stored on a robust RAID-5 fileserver which is mirrored onto a separate system as a nightly incremental backup. The key file storage here is user home directories.

Bulk data storage is primarily provided by the IC HPC service which operates a safe bulk data store with large volumes available to users. Data which users choose to keep on personal systems is the responsibility of those users to organise and usually takes the form of directly attached storage on workstations. Users are made aware that this data is not assured unless they make their own backups, and it is intended that this data should be of a useful but non-critical nature, such as development model run output which is useful for debugging and testing but not project-critical.

At the lowest level scratch-space common to and mounted across all systems is provided for short term data storage and dissemination where data is of minimal value, low required lifetime, but potentially high throughput.

A similar hierarchy of computer resources is available to users.

At the production level central College compute clusters (CX1, AX1) provide major resources for production code runs which require either large or high availability resources. Users are encouraged to make full use of these resources for production model runs, and can also use highly-redundant data repositories connected to these clusters for storing large model output file suites.

At the development level, user workstations are installed with the same software environment as the central cluster, and users can make use of this for ground-level development where they require extensive interactive model runs or build verification. At the point where users are running jobs to completion it is highly encouraged that they graduate to using the cluster environment.