Getting started with Elastic Cloud Compute Cluster on EGI Cloud with the Command Line Interface

You can find here documentation on how to deploy a sample SLURM cluster, which you can then adapt to create other kind of clusters easily.

Getting started

We will use docker for running EC3, direct installation is also possible and described at EC3 documentation. First get the docker image:

docker pull grycap/ec3

And check that you can run a simple command:

$ docker run grycap/ec3 list
 name  state  IP  nodes

For convenience we will create a directory to keep the deployment configuration and status together.

mkdir ec3-test
cd ec3-test

You can list the available templates for clusters with the templates command:

$ docker run grycap/ec3 templates
          name              kind                                         summary
          blcr            component Tool for checkpointing applications.
           sge              main    Install and configure a cluster SGE from distribution repositories.
          slurm             main    Install and configure a cluster using the grycap.slurm ansible role.
       slurm-repo           main    Install and configure a cluster SLURM from distribution repositories.

We will use the slurm template for configuring our cluster.

Site details

EC3 needs some information on the site that you are planning to use to deploy your cluster:

  1. authentication information
  2. network identifiers
  3. VM image identifiers

We will use egicli to discover all needed details, set your credentials (Check-in client id, client secret and refresh tokens) as shown in the authentication guide and start by listing the available sites:

$ egicli endpoint list
Site                type                URL
------------------  ------------------  ------------------------------------------------
IFCA-LCG2           org.openstack.nova
IN2P3-IRES          org.openstack.nova
CETA-GRID           org.openstack.nova
UA-BITP             org.openstack.nova
RECAS-BARI          org.openstack.nova
CLOUDIFIN           org.openstack.nova
IISAS-GPUCloud      org.openstack.nova
IISAS-FedCloud      org.openstack.nova
UNIV-LILLE          org.openstack.nova
INFN-PADOVA-STACK   org.openstack.nova
CYFRONET-CLOUD      org.openstack.nova
SCAI                org.openstack.nova
CESNET-MCC          org.openstack.nova
INFN-CATANIA-STACK  org.openstack.nova
CESGA               org.openstack.nova
100IT               org.openstack.nova
NCG-INGRID-PT       org.openstack.nova    org.openstack.nova
Kharkov-KIPT-LCG2   org.openstack.nova

We will use CESGA, which has as URL. Get the available projects at the site:

$  egicli endpoint projects --site CESGA
id                                Name              enabled    site
--------------------------------  ----------------  ---------  ------
3a8e9d966e644405bf19b536adf7743d  True       CESGA

Using the project id and the site name, you can create the authorisation files needed for ec3:

egicli endpoint ec3 --site CESGA --project-id 3a8e9d966e644405bf19b536adf7743d

This will generate an auth.dat file with your credentials to access the site and a templates/refresh.radl with a token refreshal mechanism to allow long running clusters to be managed on the infrastructure.

Let’s get also a working OpenStack setup:

eval "$(egicli endpoint env --site CESGA --project-id 3a8e9d966e644405bf19b536adf7743d)"

Now, get the available networks, we will need both a public and private network:

$ openstack network list
| ID                                   | Name                 | Subnets                              |
| 12ffb5f7-3e54-433f-86d0-8ffa43b52025 | | 754342b1-92df-4fc8-9499-2ee8b668141f |
| 6174db12-932f-4ee3-bb3e-7a0ca070d8f2 | public00             | 6af8c4f3-8e2e-405d-adea-c0b374c5bd99 |

Then, get the list of images available:

$  openstack image list
| ID                                   | Name                                                     | Status |
| 9d22cb3b-e6a3-4467-801a-a68214338b22 | Image for CernVM3 [CentOS/6/QEMU-KVM]                    | active |
| b03e8720-d88a-4939-b93d-23289b8eed6c | Image for CernVM4 [CentOS/7/QEMU-KVM]                    | active |
| 06cd7256-de22-4e9d-a1cf-997b5c44d938 | Image for Chipster [Ubuntu/16.04/KVM]                    | active |
| 8c4e2568-67a2-441a-b696-ac1b7c60de9c | Image for EGI CentOS 7 [CentOS/7/VirtualBox]             | active |
| abc5ebd8-f65c-4af9-8e54-a89e3b5587a3 | Image for EGI Docker [Ubuntu/18.04/VirtualBox]           | active |
| 22064e93-6af9-430b-94a1-e96473c5a72b | Image for EGI Ubuntu 16.04 LTS [Ubuntu/16.04/VirtualBox] | active |
| d5040b3e-ef33-4959-bb88-5505e229f579 | Image for EGI Ubuntu 18.04 [Ubuntu/18.04/VirtualBox]     | active |
| 79fadf3f-6092-4bb7-ab78-9a322f0aad33 | cirros                                                   | active |

For our example we will use the EGI CentOS 7 with id 8c4e2568-67a2-441a-b696-ac1b7c60de9c.

Finally, with all this information we can create the images template for EC3 that specifies the site configuration for our deployment. Save this file as templates/centos.radl:

description centos-cesga (
    kind = 'images' and
    short = 'centos7-cesga' and
    content = 'CentOS7 image at CESGA'

network public (
    provider_id = 'public00' and
    outports contains '22/tcp'

network private (provider_id = '')

system front (
    cpu.arch = 'x86_64' and
    cpu.count >= 2 and
    memory.size >= 2048 and = 'linux' and
    disk.0.image.url = 'ost://' and
    disk.0.os.credentials.username = 'centos'

system wn (
    cpu.arch = 'x86_64' and
    cpu.count >= 2 and
    memory.size >= 2048 and
    ec3_max_instances = 5 and # maximum number of worker nodes in the cluster = 'linux' and
    disk.0.image.url = 'ost://' and
    disk.0.os.credentials.username = 'centos'

Note we have used public00 as public network and opened port 22 to allow ssh access. The private network uses We have two kind of VMs in almost every deployment: the front, that runs the batch system, and the wn, that will execute the jobs. In our example, both will use the same CentOS image, which is specified with the disk.0.image.url = 'ost://' line: ost refers to OpenStack, is the hostname of the URL obtained above with egicli endpoint list and 8c4e2568-67a2-441a-b696-ac1b7c60de9c is the id of the image in OpenStack. The size of the VM is also specified.

Launch cluster

We are ready now to deploy the cluster with ec3 (this can take several minutes):

$ docker run -it -v $PWD:/root/ -w /root grycap/ec3 launch mycluster slurm ubuntu refresh -a auth.dat
Creating infrastructure
Infrastructure successfully created with ID: 74fde7be-edee-11ea-a6e9-da8b0bbd7c73
Front-end configured with IP
Transferring infrastructure
Front-end ready!

We can check the status of the deployment:

$ docker run -it -v $PWD:/root/ -w /root grycap/ec3 list
   name       state           IP        nodes
 mycluster  configured    0

And once configured, ssh to the front node. The is_cluster_ready command will report whether the cluster is fully configured or not:

$ docker run -it -v $PWD:/root/ -w /root grycap/ec3 ssh mycluster
Warning: Permanently added '' (ECDSA) to the list of known hosts.
Last login: Thu Sep  3 14:07:46 2020 from
$ bash
cloudadm@slurmserver:~$ is_cluster_ready
Cluster configured!

EC3 will deploy CLUES, a cluster management system that will power on/off nodes as needed depending on the load. Initially all the nodes will be off:

node                          state    enabled   time stable   (cpu,mem) used   (cpu,mem) total
wn1                             off    enabled     00h03'55"      0,0.0            1,1073741824.0
wn2                             off    enabled     00h03'55"      0,0.0            1,1073741824.0
wn3                             off    enabled     00h03'55"      0,0.0            1,1073741824.0
wn4                             off    enabled     00h03'55"      0,0.0            1,1073741824.0
wn5                             off    enabled     00h03'55"      0,0.0            1,1073741824.0

SLURM will also report nodes as down:

debug*       up   infinite      5  down* wn[1-5]

As we submit a first job, some nodes will be powered on to meet the request. You can also start them manually with clues poweron.

cloudadm@slurmserver:~$ srun hostname
srun: Required node not available (down, drained or reserved)
srun: job 2 queued and waiting for resources
srun: job 2 has been allocated resources
cloudadm@slurmserver:~$ clues status
node                          state    enabled   time stable   (cpu,mem) used   (cpu,mem) total
wn1                            idle    enabled     00h07'45"      0,0.0            1,1073741824.0
wn2                             off    enabled     00h52'25"      0,0.0            1,1073741824.0
wn3                             off    enabled     00h52'25"      0,0.0            1,1073741824.0
wn4                             off    enabled     00h52'25"      0,0.0            1,1073741824.0
wn5                             off    enabled     00h52'25"      0,0.0            1,1073741824.0
cloudadm@slurmserver:~$ sinfo
debug*       up   infinite      4  down* wn[2-5]
debug*       up   infinite      1   idle wn1

Destroying the cluster

Once you are done with the cluster and want to destroy it, you can use the destroy command. If your cluster was created more than one hour ago, your credentials to access the site will be expired and need to refreshed first with egicli endpoint ec3-refresh:

$ egicli endpoint ec3-refresh # refresh your auth.dat
$ docker run -it -v $PWD:/root/ -w /root grycap/ec3 list # list your clusters
   name       state           IP        nodes
 mycluster  configured    0
$ docker run -it -v $PWD:/root/ -w /root grycap/ec3 destroy mycluster -a auth.dat -y
WARNING: you are going to delete the infrastructure (including frontend and nodes).
Success deleting the cluster!

Last modified November 22, 2020: EC3 review (#151) (88ebfa6)