CLIMB-BIG-DATA | Cloud Infrastructure for Genomics

Bioinformatics rely on a vast number of tools (packages, electronic notebooks, programming languages and their libraries) that bioinformaticians need to be able to install, manage and run. A growing challenge is represented by the organisation of data inputs and outputs – particularly as genomic datasets continue to expand.

This one-day training workshop will introduce key concepts and working modalities that address these challenges, which are rapidly being adopted in the industry, including:

Using containers (such as Docker and Singularity) – currently the easiest method for managing and deploying software, easier sharing of code, and higher reproducibility of the pipelines.
Workflow languages (Nextflow DSL2) – workflow managers provide a framework for running analyses. They intrinsically provide a degree of data provenance and are easy to re-run analyses with different datasets or parameters in a range of computing environments.
GNU/Linux command-line

Prerequisites

You will need a basic understanding of navigating the GNU/Linux command line. You should be able to use commands such as cd, ls cat, grep.
You will need a basic understanding of microbial genomics.
You will need a stable internet connection and a web browser

Outcomes

By the end of the workshop,

You will learn how bioinformaticians organise their data and analysis.
You will learn how to deploy bioinformatics software through Linux containers.
You will be introduced to chaining bioinformatics software to run in a “pipeline” via NextFlow.
You will be introduced to writing your own workflows using existing NextFlow modules.
You will learn how to use these frameworks to run regular bioinformatics analyses such as assembling a microbial genome, creating a phylogenetic tree, and running basic genotyping.