cfncluster + anaconda

CfnCluster is a tool for deploying and managing high performance clusters on AWS. It uses AWS CloudFormation to allow you to quickly configure and deploy a cluster using EC2 instances. The cluster automatically scales the number of workers based on the number of jobs in the task queue, starting new instances as required (to a preconfigured maximum) and shutting down idle nodes. The entire cluster can be shutdown and restarted easily which is great for heavy but intermittent workloads. Instances run a custom AMI of Amazon Linux (other options are available). I've been using CfnCluster extensively over the last year to run water resource simulations built in Python (with the Anaconda Python distribution). My workflow often involves running a few c4.8xlarge nodes (36 cores each!) flat out for a few days of simulation, then shutting the whole thing down for a couple of weeks. Although there are some things I don't like about it, CfnCluster has been much faster and easier than trying to roll my own solution.

This post gives some instructions on setting up a cluster using cfncluster and miniconda. It assumes you're already familiar with the basics of AWS and EC2.

Installation and basic configuration

CfnCluster is available via pip (or development version from GitHub awslabs/cfncluster):

pip install cfncluster

The first time you use cfncluster you'll need to configure it. This can be done with an interactive script (cfncluster configure), or manually. I suggest using the interactive prompt then modifying the resulting file.

The config file is located at ~/.cfncluster/config (linux) or %USERPROFILE%\.cfncluster\config (windows). An example is given below.

[aws]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_KEY
aws_region_name = eu-west-1

[cluster default]
key_name = ec2-eu-west-1
vpc_settings = public

master_instance_type = c4.large
compute_instance_type = c4.8xlarge

initial_queue_size = 0
max_queue_size = 1

scheduler = sge

cluster_type = spot
spot_price = 0.6

post_install = URL_TO_POST_INSTALL_SCRIPT

ebs_settings = custom

[ebs custom]
ebs_volume_id = SHARED_VOLUME_ID

[vpc public]
vpc_id = vpc-284dce4d
master_subnet_id = subnet-283a8471

[global]
cluster_template = default
update_check = true
sanity_check = true

Subnets, etc.

TODO: How to find VPC ID & SUBNET ID?

Shared volume

Create a volume to be shared between the master/workers. This is a persistent storage.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-volume.html

EBS Volume ID looks like this: vol-06e25286b0ddbfddd

This will be mounted as /shared on all instances. It's actually mounted on the master node, and shared via NFS with the workers.

On Demand vs Spot instances

Using spot instances instead of on demand instances can cost significantly less, although you risk your instances being terminated at a moment's notice. When using spot instances you must ensure your tasks are sufficiently robust to being terminated like this. The master node is always run as an on demand instance to prevent the whole cluster falling apart.

You can use the spot price history tool available in the AWS console to identify a suitable price for the instance type you're using. Set cluster_type = spot in the config and set spot_price = X where X is the maximum hourly rate you're willing to pay in USD.

Queue size

The queue size is the number of compute instances and does not include the master node. When the cluster is started an initial number of instances are started. The number of running instances will automatically scale based on the number of pending jobs in the task queue. Idle instances will terminate just before the end of their 1-hour charging slot.

Both the maximum and desired number of node can be edited manually via the AWS console. Go to: AUTO SCALING > Auto Scaling Groups > Edit > Desired / Min / Max. If the cluster is stopped / restarted then these numbers will be reset to those in the configuration.

Managing the cluster

To create a new cluster using the default configuration (~\.cfncluster\config), use:

cfncluster create mycluster

To stop it:

cfncluster stop mycluster

To restart it after stopping:

cfncluster start mycluster

And to destroy it:

cfncluster delete mycluster

When the cluster is stopped you will only be charged for data storage of the master instance drive (and any shared drive) and for the associated elastic IP address.

Login with SSH

To login to the master node you need to use your EC2 SSH key.

ssh -i YOUR_KEY.pem ec2-user@YOUR_MASTER_IP_ADDRESS

This is simplified by configuring a host in your SSH config, $HOME/.ssh/config (linux) or %USERPROFILE%\.ssh\config (windows).

Host ec2master
Hostname YOUR_MASTER_IP_ADDRESS
User ec2-user
IdentityFile ~/.ssh/ec2-eu-west-1.pem

Replace YOUR_MASTER_IP_ADDRESS with the IP address of the master node. You can find this in the AWS console, or as the MasterServer output of cfncluster start.

Then you can:

ssh ec2master

Similarly, you can upload files onto the shared drive. The example below copies some_file.zip to the shared drive mounted at /shared.

scp some_file.zip ec2master:/shared/

Installing Miniconda with a post install script

CfnCluster uses a custom AMI based on Amazon Linux which includes the basics you need to manage the cluster. If you need additional software you will need to install it. This can be done using yum on the master node, but these changes will not automatically propagate to the compute nodes where the tasks are executed. One solution would be to further customise the AMI to include the packages you need, but this is burdonsom as you'll need to update the image every time cfncluster updates. Instead you can use the post install script to install additional software. This script is automatically run when instances start, after cfncluster has done the usual setup tasks.

Downloading and installing a large number of packages on conda can be slow. To reduce the time it takes instances to start up a compromise is found by installing Miniconda once (on the master node), creating a .tar.bz2 file of the install directory, then simply extracting it onto the new instances when they are created. This can be done using the post install script. The install is done when the worker nodes start, if if you need to update packages you will need to restart the nodes. There is probably a more elegant solution to this issue, but this was good enough for my needs.

The post install script is specified in the config as post_install = ???. I recommend keeping this script as short and simple as possible. If it raises an error the instance will terminate before it finishes starting up.

An example post_install.sh is given below, which extracts miniconda3.tar.bz2 into /opt and symlinks it. This will be run on all of the workers.

#!/bin/bash

cd /opt
mkdir -p /opt/miniconda3
chown ec2-user:ec2-user miniconda3
sudo -u ec2-user tar xfj /shared/miniconda3.tar.bz2
sudo -u ec2-user ln -f -s /opt/miniconda3 /home/ec2-user/miniconda3

Once you've written the script you should upload it to S3, make it publicly accessible, and put the URL into the config. e.g.:

https://s3-eu-west-1.amazonaws.com/snorf-eu-west-1/post_install.sh

The script used to create the miniconda.tar.bz2 is given below.

#!/bin/bash

# quit immediately on error
set -e

# install miniconda
rm -rf /opt/miniconda3
mkdir -p /opt/miniconda3
ln -f -s /opt/miniconda3 ~/miniconda3
CONDA_SCRIPT="Miniconda3-latest-Linux-x86_64.sh"
wget https://repo.continuum.io/miniconda/${CONDA_SCRIPT} -O ${CONDA_SCRIPT}
bash ${CONDA_SCRIPT} -b -p /opt/miniconda3 -f

# activate root environment
source $HOME/miniconda3/bin/activate

# add channels
conda config --add channels conda-forge

# upgrade everything
conda upgrade --all --quiet --yes

# INSTALL YOUR CUSTOM MODULES HERE

# create tarball
conda clean --all --yes
cd /opt
chown -R ec2-user:ec2-user miniconda3
rm -rf /opt/miniconda3/pkgs/*
tar -cvjSf /shared/miniconda3.tar.bz2 miniconda3

Managing jobs with Sun Grid Engine (SGE), now Orange Grid Engine

There are lots of tutorials already on the use of SGE, including the Beginner's Guide to Oracle Grid Engine.

View nodes in the cluster:

qhost

An example output showing 3 running compute instances is shown below.

HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
ip-172-31-13-176        lx-amd64        1    1    1    1  0.01  995.4M  137.1M     0.0     0.0
ip-172-31-3-231         lx-amd64        1    1    1    1  0.04  995.4M  135.9M     0.0     0.0
ip-172-31-6-55          lx-amd64        1    1    1    1  0.01  995.4M  136.3M     0.0     0.0

To submit a job (which may have multiple tasks):

qsub myjob.sh

Job script is a shell script with some additional header information. Don't forget to activate your conda environment.

Set PYTHONUNBUFFERED=1 or call Python with the -u flag to ensure output is written to log files immediately.

#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#

echo "Running on ${HOSTNAME}"

# activate conda
source ${HOME}/miniconda3/bin/activate

# run model
cd /shared/demo
python -u simple_model.py

Another example is given below, with 100 tasks (1-100) in a single job. The task ID is passed as an argument to the Python script. It's the responsibility of the script to do something useful with this.

#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -o /shared/out
#$ -t 1:100:1
#

echo "Task ${SGE_TASK_ID} running on ${HOSTNAME}"

# activate conda
source ${HOME}/miniconda3/bin/activate

# run model
cd /shared/demo
python -u simple_model.py --run-id ${SGE_TASK_ID}

To view details about jobs in the queue (both running and waiting):

qstat

Some useful scripts

This section includes a couple of useful scripts for use from the master node.

Restarting Ganglia

Sometimes Ganglia crashes with the error message:

Error: There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Connection refused

You can use the following script to restart ganglia from the master node.

#!/bin/sh

# restart ganglia and httpd
# Error: There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Connection refused

service gmond stop
service gmetad stop
service httpd stop

service gmond start
service gmetad start
service httpd start

Delete all jobs

Sometimes things go wrong and you need to delete all of the jobs!

#!/bin/bash

# delete all jobs in the schedule
qstat | awk '{print $1}' | tail -n +3 | xargs qdel