The CLIMB-BIG-DATA computational infrastructure relies on virtualisation, where physical computing is re-purposed into a scalable system of multiple independent virtual machines (VMs) run on OpenStack (a free open-source platform for cloud computing), with access to the CEPH platform to implement object storage. Through a user-friendly web portal Bryn , users gain instant, free access to their own VMs, preconfigured for microbial genome analysis with powerful but user-friendly resources such as the Genomics Virtual Laboratory and Galaxy. CLIMB-BIG-DATA users gain root access on their VMs, so that they have been able to install their own software. Over the last five years, our users have fired up over 4900 VMs!
The CLIMB system is composed of over 7,500 CPU cores of processing power. This makes it probably the largest single system dedicated to Microbial Bioinformatics research, anywhere in the world.
To provide users with local, high performance, storage we have deployed IBM GPFS in each of the 4 sites, to provide 500TB of local storage. This storage is connected to our servers using Infiniband.
Unlike most supercomputers, the CLIMB-BIG-DATA system has been designed to provide large amounts of RAM, in order to meet the challenge of processing large, rich biological datasets.
The CLIMB-BIG-DATA system provides a pool of CPU cores and RAM for microbial Bioinformatics research. The system has been designed to support over 1,000 VMs running simultaneously, supporting most of the microbial bioinformatics community within the UK.
For long-term data storage, to share datasets and VMs and to provide block storage for running VMs, we deploy a storage solution based on Ceph. Each site has 27 Dell R730XD servers, with each server containing 16x 4TB HDDs, giving a total raw storage capacity of 6912TB. All data stored in this system is replicated 3x, which gives us a usable capacity of 2304TB.
OpenStack is an open source software platform for cloud-computing. The software platform controls large pools of processing, storage, and networking resources throughout a data center. Users access the resource through a web-interface. As OpenStack is open source software, anyone who chooses to can access the source code, make any changes or modifications they need, and freely share these changes back out to the community at large.
Ceph is a scalable, software-defined storage platform delivering unified object and block storage making it ideal for cloud scale environments like OpenStack. It uses an algorithm called CRUSH (Controlled Replication Under Scalable Hashing) to ensure that data is evenly distributed across the cluster and that all cluster nodes are able to retrieve data quickly without any centralized bottlenecks.
GPFS is IBM’s parallel, shared-disk file system for cluster computers. It provides high performance by allowing data to be accessed over multiple computers at once. GPFS provides higher input/output performance by “striping” blocks of data from individual files over multiple disks, and reading and writing these blocks in parallel. GPFS provides for incredible scalability, good performance, and fault tolerance (Ie: machines can go down, and the filesystem is still accessible to others).