CfnCluster is a tool for
deploying and managing high performance clusters on AWS. It uses AWS
CloudFormation to allow you to
quickly configure and deploy a cluster using EC2 instances. The cluster
automatically scales the number of workers based on the number of jobs
in the task queue, starting new instances as required (to a
preconfigured maximum) and shutting down idle nodes. The entire cluster
can be shutdown and restarted easily which is great for heavy but
intermittent workloads. Instances run a custom AMI of Amazon Linux
(other options are available). I’ve been using CfnCluster extensively
over the last year to run water resource simulations built in Python
(with the Anaconda Python distribution). My workflow often involves
running a few c4.8xlarge
nodes (36 cores each!) flat out for a few
days of simulation, then shutting the whole thing down for a couple of
weeks. Although there are some things I don’t like about it, CfnCluster
has been much faster and easier than trying to roll my own solution.
This post gives some instructions on setting up a cluster using cfncluster and miniconda. It assumes you’re already familiar with the basics of AWS and EC2.
Installation and basic configuration
CfnCluster is available via pip (or development version from GitHub awslabs/cfncluster):
pip install cfncluster
The first time you use cfncluster you’ll need to configure it. This can
be done with an interactive script (cfncluster configure
), or
manually. I suggest using the interactive prompt then modifying the
resulting file.
The config file is located at ~/.cfncluster/config
(linux) or
%USERPROFILE%\.cfncluster\config
(windows). An example is given below.
[aws]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_KEY
aws_region_name = eu-west-1
[cluster default]
key_name = ec2-eu-west-1
vpc_settings = public
master_instance_type = c4.large
compute_instance_type = c4.8xlarge
initial_queue_size = 0
max_queue_size = 1
scheduler = sge
cluster_type = spot
spot_price = 0.6
post_install = URL_TO_POST_INSTALL_SCRIPT
ebs_settings = custom
[ebs custom]
ebs_volume_id = SHARED_VOLUME_ID
[vpc public]
vpc_id = vpc-284dce4d
master_subnet_id = subnet-283a8471
[global]
cluster_template = default
update_check = true
sanity_check = true
Access keys
See the AWS documentation on configuring access keys.
Subnets, etc.
TODO: How to find VPC ID & SUBNET ID?
Shared volume
Create a volume to be shared between the master/workers. This is a persistent storage.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-volume.html
EBS Volume ID looks like this: vol-06e25286b0ddbfddd
This will be mounted as /shared
on all instances. It’s actually
mounted on the master node, and shared via NFS with the workers.
On Demand vs Spot instances
Using spot instances instead of on demand instances can cost significantly less, although you risk your instances being terminated at a moment’s notice. When using spot instances you must ensure your tasks are sufficiently robust to being terminated like this. The master node is always run as an on demand instance to prevent the whole cluster falling apart.
You can use the spot price history tool available in the AWS console to
identify a suitable price for the instance type you’re using. Set
cluster_type = spot
in the config and set spot_price = X
where X is
the maximum hourly rate you’re willing to pay in USD.
Queue size
The queue size is the number of compute instances and does not include the master node. When the cluster is started an initial number of instances are started. The number of running instances will automatically scale based on the number of pending jobs in the task queue. Idle instances will terminate just before the end of their 1-hour charging slot.
Both the maximum and desired number of node can be edited manually via the AWS console. Go to: AUTO SCALING > Auto Scaling Groups > Edit > Desired / Min / Max. If the cluster is stopped / restarted then these numbers will be reset to those in the configuration.
Managing the cluster
To create a new cluster using the default configuration
(~\.cfncluster\config
), use:
cfncluster create mycluster
To stop it:
cfncluster stop mycluster
To restart it after stopping:
cfncluster start mycluster
And to destroy it:
cfncluster delete mycluster
When the cluster is stopped you will only be charged for data storage of the master instance drive (and any shared drive) and for the associated elastic IP address.
Login with SSH
To login to the master node you need to use your EC2 SSH key.
ssh -i YOUR_KEY.pem ec2-user@YOUR_MASTER_IP_ADDRESS
This is simplified by configuring a host in your SSH config,
$HOME/.ssh/config
(linux) or %USERPROFILE%\.ssh\config
(windows).
Host ec2master
Hostname YOUR_MASTER_IP_ADDRESS
User ec2-user
IdentityFile ~/.ssh/ec2-eu-west-1.pem
Replace YOUR_MASTER_IP_ADDRESS
with the IP address of the master node.
You can find this in the AWS console, or as the MasterServer output of
cfncluster start
.
Then you can:
ssh ec2master
Similarly, you can upload files onto the shared drive. The example below
copies some_file.zip
to the shared drive mounted at /shared
.
scp some_file.zip ec2master:/shared/
Installing Miniconda with a post install script
CfnCluster uses a custom AMI based on Amazon Linux which includes the
basics you need to manage the cluster. If you need additional software
you will need to install it. This can be done using yum
on the master
node, but these changes will not automatically propagate to the compute
nodes where the tasks are executed. One solution would be to further
customise the AMI to include the packages you need, but this is
burdonsom as you’ll need to update the image every time cfncluster
updates. Instead you can use the post install script to install
additional software. This script is automatically run when instances
start, after cfncluster has done the usual setup tasks.
Downloading and installing a large number of packages on conda can be
slow. To reduce the time it takes instances to start up a compromise is
found by installing Miniconda once (on the master node), creating a
.tar.bz2
file of the install directory, then simply extracting it onto
the new instances when they are created. This can be done using the post
install script. The install is done when the worker nodes start, if if
you need to update packages you will need to restart the nodes. There is
probably a more elegant solution to this issue, but this was good enough
for my needs.
The post install script is specified in the config as
post_install = ???
. I recommend keeping this script as short and
simple as possible. If it raises an error the instance will terminate
before it finishes starting up.
An example post_install.sh
is given below, which extracts
miniconda3.tar.bz2 into /opt and symlinks it. This will be run on all of
the workers.
#!/bin/bash
cd /opt
mkdir -p /opt/miniconda3
chown ec2-user:ec2-user miniconda3
sudo -u ec2-user tar xfj /shared/miniconda3.tar.bz2
sudo -u ec2-user ln -f -s /opt/miniconda3 /home/ec2-user/miniconda3
Once you’ve written the script you should upload it to S3, make it publicly accessible, and put the URL into the config. e.g.:
https://s3-eu-west-1.amazonaws.com/snorf-eu-west-1/post_install.sh
The script used to create the miniconda.tar.bz2
is given below.
#!/bin/bash
# quit immediately on error
set -e
# install miniconda
rm -rf /opt/miniconda3
mkdir -p /opt/miniconda3
ln -f -s /opt/miniconda3 ~/miniconda3
CONDA_SCRIPT="Miniconda3-latest-Linux-x86_64.sh"
wget https://repo.continuum.io/miniconda/${CONDA_SCRIPT} -O ${CONDA_SCRIPT}
bash ${CONDA_SCRIPT} -b -p /opt/miniconda3 -f
# activate root environment
source $HOME/miniconda3/bin/activate
# add channels
conda config --add channels conda-forge
# upgrade everything
conda upgrade --all --quiet --yes
# INSTALL YOUR CUSTOM MODULES HERE
# create tarball
conda clean --all --yes
cd /opt
chown -R ec2-user:ec2-user miniconda3
rm -rf /opt/miniconda3/pkgs/*
tar -cvjSf /shared/miniconda3.tar.bz2 miniconda3
Managing jobs with Sun Grid Engine (SGE), now Orange Grid Engine
There are lots of tutorials already on the use of SGE, including the Beginner’s Guide to Oracle Grid Engine.
View nodes in the cluster:
qhost
An example output showing 3 running compute instances is shown below.
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
global - - - - - - - - - -
ip-172-31-13-176 lx-amd64 1 1 1 1 0.01 995.4M 137.1M 0.0 0.0
ip-172-31-3-231 lx-amd64 1 1 1 1 0.04 995.4M 135.9M 0.0 0.0
ip-172-31-6-55 lx-amd64 1 1 1 1 0.01 995.4M 136.3M 0.0 0.0
To submit a job (which may have multiple tasks):
qsub myjob.sh
Job script is a shell script with some additional header information. Don’t forget to activate your conda environment.
Set PYTHONUNBUFFERED=1
or call Python with the -u
flag to ensure
output is written to log files immediately.
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
echo "Running on ${HOSTNAME}"
# activate conda
source ${HOME}/miniconda3/bin/activate
# run model
cd /shared/demo
python -u simple_model.py
Another example is given below, with 100 tasks (1-100) in a single job. The task ID is passed as an argument to the Python script. It’s the responsibility of the script to do something useful with this.
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -o /shared/out
#$ -t 1:100:1
#
echo "Task ${SGE_TASK_ID} running on ${HOSTNAME}"
# activate conda
source ${HOME}/miniconda3/bin/activate
# run model
cd /shared/demo
python -u simple_model.py --run-id ${SGE_TASK_ID}
To view details about jobs in the queue (both running and waiting):
qstat
Some useful scripts
This section includes a couple of useful scripts for use from the master node.
Restarting Ganglia
Sometimes Ganglia crashes with the error message:
Error: There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Connection refused
You can use the following script to restart ganglia from the master node.
#!/bin/sh
# restart ganglia and httpd
# Error: There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Connection refused
service gmond stop
service gmetad stop
service httpd stop
service gmond start
service gmetad start
service httpd start
Delete all jobs
Sometimes things go wrong and you need to delete all of the jobs!
#!/bin/bash
# delete all jobs in the schedule
qstat | awk '{print $1}' | tail -n +3 | xargs qdel