Containers 3: Building images
In the previous section we downloaded a Docker image of Ubuntu and noticed that it was based on layers, each with a unique hash as id. An image in Docker is based on a number of read-only layers, where each layer contains the differences to the previous layers. If you’ve done the Git tutorial this might remind you of how a Git commit contains the difference to the previous commit. The great thing about this is that we can start from one base layer, say containing an operating system and some utility programs, and then generate many new images based on this, say 10 different project-specific images. This dramatically reduces the storage space requirements. For example, Bioconda (see the Conda tutorial) has one base image and then one individual layer for each of the more than 3000 packages available in Bioconda.
Docker provides a convenient way to describe how to go from a base image to the image we want by using a “Dockerfile”. This is a simple text file containing the instructions for how to generate each layer. Docker images are typically quite large, often several GBs, while Dockerfiles are small and serve as blueprints for the images. It is therefore good practice to have your Dockerfile in your project Git repository, since it allows other users to exactly replicate your project environment.
We will be looking at a Dockerfile called
Dockerfile_slim
that is located in your
containers
directory (where you should hopefully be
standing already). We will now go through that file and discuss the
different steps and what they do. After that we’ll build the image and
test it out. Lastly, we’ll start from that image and make a new one to
reproduce the results from the Conda
tutorial.
Understanding Dockerfiles
Here are the first few lines of Dockerfile_slim
. Each
line in the Dockerfile will typically result in one layer in the
resulting image. The format for Dockerfiles is
INSTRUCTION arguments
. A full specification of the format,
together with best practices, can be found here
Links to an external site..
FROM condaforge/mambaforge
LABEL description = "Minimal image for the NBIS reproducible research course."
MAINTAINER "John Sundh" john.sundh@scilifelab.se
Here we use the instructions FROM
, LABEL
and MAINTAINER
. While LABEL
and
MAINTAINER
is just meta-data that can be used for
organizing your various Docker components the important one is
FROM
, which specifies the base image we want to start from.
Because we want to use mamba
to install packages we will
start from an image from the conda-forge community that has
mamba
pre-installed. This image was in turn built using a
Dockerfile as a blueprint and then uploaded to Dockerhub
Links to an external site.. The
conda-forge community keeps the Dockerfile in a git repository and you
can view the file here
Links to an external site..
You will see that it starts from an official Ubuntu image (check the
first line with the FROM
instruction), followed by code to
install various packages including mamba.
Many roads to Rome
When it comes to choosing the best image to start from there are multiple routes you could take. Say you want to run RStudio in a Conda environment through a Jupyter notebook. You could then start from one of the rocker images Links to an external site. for R, a Mambaforge image Links to an external site., or a Jupyter image Links to an external site.. Or you just start from one of the low-level official images and set up everything from scratch.
Let’s take a look at the next section of
Dockerfile_slim
.
# Use bash as shell
SHELL ["/bin/bash", "--login", "-c"]
# Set workdir
WORKDIR /course
# Set time zone
ENV TZ="Europe/Stockholm"
ENV DEBIAN_FRONTEND=noninteractive
SHELL
simply sets which shell to use and
WORKDIR
determines the directory the container should start
in. The ENV
instruction is used to set environmental
variables and here we use it to set the time zone by declaring a
TZ
variable. The
DEBIAN_FRONTEND=noninteractive
line means that we force the
subsequent installation to not prompt us to set the time zone
manually.
The next few lines introduce the important RUN
instruction, which is used for executing shell commands:
# Install package for setting time zone
RUN apt-get update && apt-get install -y tzdata && apt-get clean
# Configure Conda/Mamba
RUN mamba init bash && conda config --set channel_priority strict && \
conda config --append channels bioconda && \
conda config --append channels r && \
conda config --set subdir linux-64
The first RUN command installs the tzdata
package for
managing local time settings in the container. This may not always be
required for your Dockerfile but it’s added here because some R packages
used in the course require it.
Note
While installing things withapt-get
inside Dockerfiles is relatively common practice, it’s important to note that this may affect reproducibility, since it’s not common to specify an exact version. The packages installed in this manner are, however, usually not important for the actual analyses performed, but rather help in the building of the container image itself. While not critical, it’s important to note this from a reproducibility perspective.
Next, we run mamba init bash
to initialize the bash
shell inside the image, meaning we can use mamba activate
in containers that run from the image. In the same RUN
statement we also configure the strict channel priority and add
appropriate channels with conda config
. You’ll probably
recognize this from the pre-course-setup. The
last part sets the somewhat obscure subdir
config parameter
pointing to the linux-64
architecture of conda
channels.
As a general rule, you want each layer in an image to be a “logical
unit”. For example, if you want to install a program the
RUN
command should both retrieve the program, install it
and perform any necessary clean up. This is due to how layers work and
how Docker decides what needs to be rerun between builds. More on this
later.
Next up is:
# Open port for running Jupyter Notebook
EXPOSE 8888
# Start Bash shell by default
CMD /bin/bash
EXPOSE
opens up the port 8888, so that we can later run
a Jupyter Notebook server on that port. CMD
is an
interesting instruction. It sets what a container should run when
nothing else is specified, i.e. if you run
docker run [OPTIONS] [IMAGE]
without the additional
[COMMAND] [ARG]
. It can be used for example for printing
some information on how to use the image or, as here, start a Bash shell
for the user. If the purpose of your image is to accompany a publication
then CMD
could be to run the workflow that generates the
paper figures from raw data, e.g.
CMD snakemake -s Snakefile -c 1 generate_figures
.
Building from Dockerfiles
Now we understand how a Dockerfile works. Constructing the image itself from the Dockerfile can be done as follows - try it out:
Important
If your computer is a MAC with the M1 chip, you may have to add--platform linux/x86_64
to thedocker build
command.
docker build -f Dockerfile_slim -t my_docker_image .
This should result in something similar to this:
[+] Building 2.2s (7/7) FINISHED
=> [internal] load build definition from Dockerfile_slim 0.0s
=> => transferring dockerfile: 667B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/condaforge/mambaforge:latest 0.0s
=> [1/3] FROM docker.io/condaforge/mambaforge 0.0s
=> CACHED [2/3] WORKDIR /course 0.0s
=> [3/3] RUN mamba init bash && conda config --set channel_priority strict && conda config --append channels bioconda && conda config --append channels r && conda config --set subdir 2.1s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:53e6efeaa063eadf44c509c770d887af5e222151f08312e741aecc687e6e8981 0.0s
=> => naming to docker.io/library/my_docker_image
Exactly how the output looks depends on which version of Docker you
are using. The -f
flag sets which Dockerfile to use and
-t
tags the image with a name. This name is how you will
refer to the image later. Lastly, the .
is the path to
where the image should be build (.
means the current
directory). This had no real impact in this case, but matters if you
want to import files. Validate with docker image ls
that
you can see your new image.
Creating your own Dockerfile
Now it’s time to make your own Dockerfile to reproduce the results from the Conda tutorial. If you haven’t done the tutorial, it boils down to creating a Conda environment file, setting up that environment, downloading three RNA-seq data files, and running FastQC on those files. We will later package and run the whole RNA-seq workflow in a Docker container, but for now we keep it simple to reduce the size and time required.
The Conda tutorial uses a shell script, run_qc.sh
, for
downloading and running the analysis. A copy of this file should also be
available in your current directory. If we want to use the same script
we need to include it in the image. A basic outline of what we need to
do is:
- Create a file called
Dockerfile_conda
- Start the image from the
my_docker_image
we just built - Install the package
fastqc
which is required for the analysis. - Add the
run_qc.sh
script to the image - Set the default command of the image to run the
run_qc.sh
script.
We’ll now go through these steps in more detail. Try to add the
corresponding code to Dockerfile_conda
on your own, and if
you get stuck you can click to reveal the solution below under “Click to
show solution”.
Set image starting point
To set the starting point of the new image, use the FROM
instruction and point to my_docker_image
that we built in
the previous Building from Dockerfiles step.
Install packages
Use the RUN
instruction to install the package
fastqc=0.11.9
with Mamba. Here there are several options
available. For instance we could add an environment file e.g.
environment.yml
from the Conda tutorial and use
mamba env create
to create an environment from that file.
Or we could create an environment directly with
mamba create
. We’ll try this later option here, so add a
line that will create an environment named project_mrsa
containing the fastqc
package, and also clean up packages
and cache after installation. Use the -y
flag to
mamba create
to avoid the prompt that expects an
interaction from the user.
In order to have the project_mrsa
environment activated
upon start-up we need to add two more lines to the Dockerfile. First we
need to use a RUN
instruction to run
echo "source activate project_mrsa" >> ~/.bashrc
, and
then we need to use the ENV
instruction to set the
$PATH
variable inside the image to
/opt/conda/envs/project_mrsa/bin:$PATH
.
Add the analysis script
Use the COPY
instruction to Add run_qc.sh
to the image. The syntax is COPY SOURCE TARGET
. In this
case SOURCE
is the run_qc.sh
script and
TARGET
is a path inside the image, for simplicity it can be
specified with ./
.
Set default command
Use the CMD
instruction to set the default command for
the image to bash run_qc.sh
.
Click to show
FROM my_docker_image
RUN mamba create -y -n project_mrsa -c bioconda fastqc=0.11.9 && mamba clean -a
RUN echo "source activate project_mrsa" >> ~/.bashrc
ENV PATH=/opt/conda/envs/project_mrsa/bin:$PATH
COPY run_qc.sh .
CMD bash run_qc.sh
Build the image and tag it my_docker_conda
(remember to
add --platform linux/x86_64
to the build command if you are
using a Mac with the Apple chip).
docker build -t my_docker_conda -f Dockerfile_conda .
Verify that the image was built using
docker image ls
.
Quick recap
In this section we’ve learned:
- How the keywords
FROM
,LABEL
,MAINTAINER
,RUN
,ENV
,SHELL
,WORKDIR
, andCMD
can be used when writing a Dockerfile.- How to use
docker build
to construct and tag an image from a Dockerfile.- How to create your own Dockerfile.