EC3 CLI

Getting started with Elastic Cloud Compute Cluster on EGI Cloud with the Command Line Interface

You can find here documentation on how to deploy a sample SLURM cluster, which you can then adapt to create other kind of clusters easily.

Getting started

We will use docker for running EC3, direct installation is also possible and described at EC3 documentation. First get the docker image:

docker pull grycap/ec3

And check that you can run a simple command:

$ docker run grycap/ec3 list
 name  state  IP  nodes
------------------------

For convenience we will create a directory to keep the deployment configuration and status together.

mkdir ec3-test
cd ec3-test

You can list the available templates for clusters with the templates command:

$ docker run grycap/ec3 templates
          name              kind                                         summary
----------------------------------------------------------------------------------------------------------------------
          blcr            component Tool for checkpointing applications.
[...]
           sge              main    Install and configure a cluster SGE from distribution repositories.
          slurm             main    Install and configure a cluster using the grycap.slurm ansible role.
       slurm-repo           main    Install and configure a cluster SLURM from distribution repositories.
[...]

We will use the slurm template for configuring our cluster.

Site details

EC3 needs some information on the site that you are planning to use to deploy your cluster:

  1. authentication information
  2. network identifiers
  3. VM image identifiers

We will use egicli to discover all needed details, set your credentials (Check-in client id, client secret and refresh tokens) as shown in the authentication guide and start by listing the available sites:

$ egicli endpoint list
Site                type                URL
------------------  ------------------  ------------------------------------------------
IFCA-LCG2           org.openstack.nova  https://api.cloud.ifca.es:5000/v3/
IN2P3-IRES          org.openstack.nova  https://sbgcloud.in2p3.fr:5000/v3
CETA-GRID           org.openstack.nova  https://controller.ceta-ciemat.es:5000/v3/
UA-BITP             org.openstack.nova  https://openstack.bitp.kiev.ua:5000/v3
RECAS-BARI          org.openstack.nova  https://cloud.recas.ba.infn.it:5000/v3
CLOUDIFIN           org.openstack.nova  https://cloud-ctrl.nipne.ro:443/v3
IISAS-GPUCloud      org.openstack.nova  https://keystone3.ui.savba.sk:5000/v3/
IISAS-FedCloud      org.openstack.nova  https://nova.ui.savba.sk:5000/v3/
UNIV-LILLE          org.openstack.nova  https://thor.univ-lille.fr:5000/v3
INFN-PADOVA-STACK   org.openstack.nova  https://egi-cloud.pd.infn.it:443/v3
CYFRONET-CLOUD      org.openstack.nova  https://panel.cloud.cyfronet.pl:5000/v3/
SCAI                org.openstack.nova  https://fc.scai.fraunhofer.de:5000/v3
CESNET-MCC          org.openstack.nova  https://identity.cloud.muni.cz/v3
INFN-CATANIA-STACK  org.openstack.nova  https://stack-server.ct.infn.it:35357/v3
CESGA               org.openstack.nova  https://fedcloud-osservices.egi.cesga.es:5000/v3
100IT               org.openstack.nova  https://cloud-egi.100percentit.com:5000/v3/
NCG-INGRID-PT       org.openstack.nova  https://stratus.ncg.ingrid.pt:5000/v3
fedcloud.srce.hr    org.openstack.nova  https://cloud.cro-ngi.hr:5000/v3/
Kharkov-KIPT-LCG2   org.openstack.nova  https://cloud.kipt.kharkov.ua:5000/v3

We will use CESGA, which has https://fedcloud-osservices.egi.cesga.es:5000/v3 as URL. Get the available projects at the site:

$  egicli endpoint projects --site CESGA
id                                Name              enabled    site
--------------------------------  ----------------  ---------  ------
3a8e9d966e644405bf19b536adf7743d  vo.access.egi.eu  True       CESGA

Using the project id and the site name, you can create the authorisation files needed for ec3:

egicli endpoint ec3 --site CESGA --project-id 3a8e9d966e644405bf19b536adf7743d

This will generate an auth.dat file with your credentials to access the site and a templates/refresh.radl with a token refreshal mechanism to allow long running clusters to be managed on the infrastructure.

Let’s get also a working OpenStack setup:

eval "$(egicli endpoint env --site CESGA --project-id 3a8e9d966e644405bf19b536adf7743d)"

Now, get the available networks, we will need both a public and private network:

$ openstack network list
+--------------------------------------+----------------------+--------------------------------------+
| ID                                   | Name                 | Subnets                              |
+--------------------------------------+----------------------+--------------------------------------+
| 12ffb5f7-3e54-433f-86d0-8ffa43b52025 | net-vo.access.egi.eu | 754342b1-92df-4fc8-9499-2ee8b668141f |
| 6174db12-932f-4ee3-bb3e-7a0ca070d8f2 | public00             | 6af8c4f3-8e2e-405d-adea-c0b374c5bd99 |
+--------------------------------------+----------------------+--------------------------------------+

Then, get the list of images available:

$  openstack image list
+--------------------------------------+----------------------------------------------------------+--------+
| ID                                   | Name                                                     | Status |
+--------------------------------------+----------------------------------------------------------+--------+
| 9d22cb3b-e6a3-4467-801a-a68214338b22 | Image for CernVM3 [CentOS/6/QEMU-KVM]                    | active |
| b03e8720-d88a-4939-b93d-23289b8eed6c | Image for CernVM4 [CentOS/7/QEMU-KVM]                    | active |
| 06cd7256-de22-4e9d-a1cf-997b5c44d938 | Image for Chipster [Ubuntu/16.04/KVM]                    | active |
| 8c4e2568-67a2-441a-b696-ac1b7c60de9c | Image for EGI CentOS 7 [CentOS/7/VirtualBox]             | active |
| abc5ebd8-f65c-4af9-8e54-a89e3b5587a3 | Image for EGI Docker [Ubuntu/18.04/VirtualBox]           | active |
| 22064e93-6af9-430b-94a1-e96473c5a72b | Image for EGI Ubuntu 16.04 LTS [Ubuntu/16.04/VirtualBox] | active |
| d5040b3e-ef33-4959-bb88-5505e229f579 | Image for EGI Ubuntu 18.04 [Ubuntu/18.04/VirtualBox]     | active |
| 79fadf3f-6092-4bb7-ab78-9a322f0aad33 | cirros                                                   | active |
+--------------------------------------+----------------------------------------------------------+--------+

For our example we will use the EGI CentOS 7 with id 8c4e2568-67a2-441a-b696-ac1b7c60de9c.

Finally, with all this information we can create the images template for EC3 that specifies the site configuration for our deployment. Save this file as templates/centos.radl:

description centos-cesga (
    kind = 'images' and
    short = 'centos7-cesga' and
    content = 'CentOS7 image at CESGA'
)

network public (
    provider_id = 'public00' and
    outports contains '22/tcp'
)

network private (provider_id = 'net-vo.access.egi.eu')

system front (
    cpu.arch = 'x86_64' and
    cpu.count >= 2 and
    memory.size >= 2048 and
    disk.0.os.name = 'linux' and
    disk.0.image.url = 'ost://fedcloud-osservices.egi.cesga.es/8c4e2568-67a2-441a-b696-ac1b7c60de9c' and
    disk.0.os.credentials.username = 'centos'
)

system wn (
    cpu.arch = 'x86_64' and
    cpu.count >= 2 and
    memory.size >= 2048 and
    ec3_max_instances = 5 and # maximum number of worker nodes in the cluster
    disk.0.os.name = 'linux' and
    disk.0.image.url = 'ost://fedcloud-osservices.egi.cesga.es/8c4e2568-67a2-441a-b696-ac1b7c60de9c' and
    disk.0.os.credentials.username = 'centos'
)

Note we have used public00 as public network and opened port 22 to allow ssh access. The private network uses net-vo.access.egi.eu. We have two kind of VMs in almost every deployment: the front, that runs the batch system, and the wn, that will execute the jobs. In our example, both will use the same CentOS image, which is specified with the disk.0.image.url = 'ost://fedcloud-osservices.egi.cesga.es/8c4e2568-67a2-441a-b696-ac1b7c60de9c' line: ost refers to OpenStack, fedcloud-osservices.egi.cesga.es is the hostname of the URL obtained above with egicli endpoint list and 8c4e2568-67a2-441a-b696-ac1b7c60de9c is the id of the image in OpenStack. The size of the VM is also specified.

Launch cluster

We are ready now to deploy the cluster with ec3 (this can take several minutes):

$ docker run -it -v $PWD:/root/ -w /root grycap/ec3 launch mycluster slurm ubuntu refresh -a auth.dat
Creating infrastructure
Infrastructure successfully created with ID: 74fde7be-edee-11ea-a6e9-da8b0bbd7c73
Front-end configured with IP 193.144.46.234
Transferring infrastructure
Front-end ready!

We can check the status of the deployment:

$ docker run -it -v $PWD:/root/ -w /root grycap/ec3 list
   name       state           IP        nodes
----------------------------------------------
 mycluster  configured  193.144.46.234    0

And once configured, ssh to the front node. The is_cluster_ready command will report whether the cluster is fully configured or not:

$ docker run -it -v $PWD:/root/ -w /root grycap/ec3 ssh mycluster
Warning: Permanently added '193.144.46.234' (ECDSA) to the list of known hosts.
Last login: Thu Sep  3 14:07:46 2020 from torito.i3m.upv.es
$ bash
cloudadm@slurmserver:~$ is_cluster_ready
Cluster configured!
cloudadm@slurmserver:~$

EC3 will deploy CLUES, a cluster management system that will power on/off nodes as needed depending on the load. Initially all the nodes will be off:

node                          state    enabled   time stable   (cpu,mem) used   (cpu,mem) total
-----------------------------------------------------------------------------------------------
wn1                             off    enabled     00h03'55"      0,0.0            1,1073741824.0
wn2                             off    enabled     00h03'55"      0,0.0            1,1073741824.0
wn3                             off    enabled     00h03'55"      0,0.0            1,1073741824.0
wn4                             off    enabled     00h03'55"      0,0.0            1,1073741824.0
wn5                             off    enabled     00h03'55"      0,0.0            1,1073741824.0

SLURM will also report nodes as down:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      5  down* wn[1-5]

As we submit a first job, some nodes will be powered on to meet the request. You can also start them manually with clues poweron.

cloudadm@slurmserver:~$ srun hostname
srun: Required node not available (down, drained or reserved)
srun: job 2 queued and waiting for resources
srun: job 2 has been allocated resources
wn1.localdomain
cloudadm@slurmserver:~$ clues status
node                          state    enabled   time stable   (cpu,mem) used   (cpu,mem) total
-----------------------------------------------------------------------------------------------
wn1                            idle    enabled     00h07'45"      0,0.0            1,1073741824.0
wn2                             off    enabled     00h52'25"      0,0.0            1,1073741824.0
wn3                             off    enabled     00h52'25"      0,0.0            1,1073741824.0
wn4                             off    enabled     00h52'25"      0,0.0            1,1073741824.0
wn5                             off    enabled     00h52'25"      0,0.0            1,1073741824.0
cloudadm@slurmserver:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      4  down* wn[2-5]
debug*       up   infinite      1   idle wn1

Destroying the cluster

Once you are done with the cluster and want to destroy it, you can use the destroy command. If your cluster was created more than one hour ago, your credentials to access the site will be expired and need to refreshed first with egicli endpoint ec3-refresh:

$ egicli endpoint ec3-refresh # refresh your auth.dat
$ docker run -it -v $PWD:/root/ -w /root grycap/ec3 list # list your clusters
   name       state           IP        nodes
----------------------------------------------
 mycluster  configured  193.144.46.234    0
$ docker run -it -v $PWD:/root/ -w /root grycap/ec3 destroy mycluster -a auth.dat -y
WARNING: you are going to delete the infrastructure (including frontend and nodes).
Success deleting the cluster!

Last modified November 22, 2020: EC3 review (#151) (88ebfa6)