EGI Architecture

The architecture of the EGI Federation

The EGI Federated Cloud (FedCloud) is a multi-national cloud system that integrates community, private and/or public clouds into a scalable computing platform for research. The Federation pools resources from a heterogeneous set of cloud providers using a single authentication and authorisation framework that allows the portability of workloads across multiple providers, and enables bringing computing to data. The current implementation is focused on Infrastructure-as-a-Service (IaaS) services, but can be easily applied to Platform-as-a-Service (PaaS) and Software-as-a-Servcice (SaaS) layers.

Each resource centre of the federated infrastructure operates a Cloud Management Framework (CMF) according to its own preferences and constraints and joins the federation by integrating this CMF with components of the EGI service portfolio. CMFs must at least be integrated with EGI Authentication and Authorization Infrastructure (AAI) so users can access services with a single identity, integration with other components and APIs to be provided are agreed by the community the resource centre provides services to.

EGI follows a Service Integration and Management (SIAM) approach to manage the federation with processes that cover the different aspects of the IT Service Management. Providers in the federation keep complete control of their services and resources. EGI creates Virtual Organizations (VOs) for each research community, and EGI VO Operation Level Agreements (OLAs) establish a reliable, trust-based communication channel between the community and the providers, by agreeing on the services, their levels and the types of support.

Federated IaaS

The EGI FedCloud IaaS resource centres deploy a Cloud Management Framework (CMF) that provide users with an API-based service for management of Virtual Machines and associated Block Storage to enable persistence and Networks to enable connectivity of the Virtual Machines (VMs) among themselves and third party resources.

The IaaS federation is a thin layer that brings the providers together with:

The IaaS capabilities (VM, block storage, network management, etc.) must be provided via community agreed APIs (OpenStack is supported at the moment) that allow integration with EGI Check-in for authentication and authorisation of users.

Users and Community platforms built on top of the EGI IaaS can interact with the cloud providers at three different layers:

  • Directly using the IaaS APIs or CLIs to manage individual resources. This option is recommended for preexisting use cases with requirements on specific APIs.
  • Using federated access tools that allow managing the complexity of dealing with different providers in a uniform way. These tools include:
    • Provisioning systems allow users to define infrastructure as code, then manage and combine resources from different providers, thus enabling the portability of application deployments between them (e.g. Infrastructure Manager or Terraform), and
    • Cloud brokers provide matchmaking for workloads to available providers (e.g. the INDIGO-DataCloud Orchestrator).
  • Using the VMOps dashboard.

EGI provides ready-to-use software components to enable the federation for OpenStack. These components rely on public APIs of the IaaS system and use Check-in accounts for authenticating into the provider.

Implementation

Authentication and authorization

Federated identity ensures that users of the federation can use a single account for accessing the resources.

OpenID Connect

Providers of the EGI Cloud support authentication with OAuth2 tokens provided by Check-in OpenID Connect Identity provider. Support builds on the AAI guide for SPs with detailed configuration provided at the EGI IaaS Service providers documentation.

The integration relies on the OpenStack Keystone OS-FEDERATION API.

Information discovery

The Configuration Database contains the list of resource centres and their endpoints, while the AppDB Information System collects this information in a central service for discovery, providing a real-time view of the actual capabilities of federation participants (can be used by both human users and machine services).

Configuration Database

The EGI Configuration Database is used to catalogue the static information of the production infrastructure topology (e.g. the list of resource centres and their endpoints).

To allow resource providers to expose IaaS federation endpoints, the following service types are available:

  • org.openstack.horizon
  • org.openstack.nova
  • org.openstack.swift
  • eu.egi.cloud.accounting
  • eu.egi.cloud.vm-metadata.marketplace

All providers must enter cloud service endpoints into the Configuration Database to enable integration with EGI.

The Cloud Info Provider extracts information from the resource centres using their native APIs and formats it following Glue, an OGC recommended standard. This information is pushed to the Argo Messaging System and consumed by AppDB to provide a central information discovery service that aggregates several other sources of information about the infrastructure.

Virtual Machine Image management

In a distributed, federated IaaS service, users need solutions for efficiently managing and distributing their VM images across multiple resource providers. EGI provides a catalogue of VM images (VMIs) that allows any user to share their VMI, and communities to select those VMIs relevant for distribution across providers. These images are automatically replicated at the providers supporting the community and converted as needed to ensure the correct instantiation when used.

AppDB includes a Virtual Appliance Marketplace supporting Virtual Appliances (VAs), which are clean and lean virtual machine images designed to run on a virtualisation platform, that provide a software solution out-of-the-box, ready to be used with minimal or no set-up.

AppDB allows representatives of research communities (VOs) to generate a VM image list that resource centres subscribe to. The subscription enables the periodic download, conversion and storage of those images to the image repository of the indicated resource centres, using HEPiX image list format. cloudkeeper provides this automated synchronisation between AppDB and the cloud provider.

Accounting

Federated Accounting provides an integrated view about resource/service usage: it pulls together usage information from the federated sites and services, integrates the data and presents them in such a way that both individual users as well as whole communities can monitor their own resource/service usage across the whole federation.

Usage of resources is gathered centrally using EGI Accounting repository and available for visualisation at EGI Accounting portal.

Cloud Usage Record

The federated cloud task force has agreed on a Cloud Usage Record, which inherits from the OGF Usage Record. This record defines the data that resource providers must send to EGI’s central Accounting repository.

Version 0.4 of the Cloud Accounting Usage Record was agreed at the FedCloud Face to Face in Amsterdam in January 2015. A summary table of the format is shown below:

Cloud Usage Record PropertyTypeNullDefinition
VMUUIDvarchar(255)NoVirtual Machine's Universally Unique Identifier concatenation of CurrentTime, SiteName and MachineName
SiteNamevarchar(255)NoGOCDB SiteName - GOCDB now has cloud service types and a cloud-only site is allowed.
CloudComputeService (NEW)varchar(255)Name identifying cloud resource within the site. Allows multiple cloud resources within a site, i.e. a level of granularity.
MachineNamevarchar(255)NoVM ID - the site name for the VM
LocalUserIdvarchar(255)Local username
LocalGroupIdvarchar(255)Local group name
GlobalUserNamevarchar(255)Global identity of user (certificate DN)
FQANvarchar(255)Use if VOs part of authorization mechanism
Statusvarchar(255)Completion status - completed, started or suspended
StartTimedatetimeMust be set when Status = started
EndTimedatetimeSet to NULL until Status = completed
SuspendDurationdatetimeSet when Status = suspended (Timestamp)
WallDurationintWallClock time - actual time used
CpuDurationintCPU time consumed (Duration)
CpuCountintNumber of CPUs allocated
NetworkTypevarchar(255)Needs clarifying
NetworkInboundintGB received
NetworkOutboundintGB sent
PublicIPCount (NEW)intNumber of public IP addresses assigned to VM Not used.
MemoryintMemory allocated to the VM
DiskintSize in GB allocated to the VM
BenchmarkType (NEW)varchar(255)Name of benchmark used for normalization of times (eg HEPSPEC06)
Benchmark (NEW)DecimalValue of benchmark of VM using ServiceLevelType benchmark’
StorageRecordIdvarchar(255)Link to other associated storage record Need to check feasibility
ImageIdvarchar(255)Every image has a unique ID associated with it. For images from the EGI FedCloud AppDB this should be VMCATCHER_EVENT_AD_MPURI; for images from other repositories it should be a vmcatcher equivalent; for local images - local identifier of the image.
CloudTypevarchar(255) Type of cloud infrastructure: OpenNebula; OpenStack; Synnefo; etc.

Public IP Usage Record

The fedcloud task force has agreed on an IP Usage Record. The format uses many of the same fields as the Cloud Usage Record. The Usage Record should be a "snapshot" of the number of IPs currently assigned to a user. A table defining v0.2 of the format is shown below:

Cloud Usage Record PropertyTypeNullDefinitionNotes
MeasurementTimedatetimeNoThe time the usage was recorded.In the message format, must be a UNIX timestamp, i.e. the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970)
SiteNamevarchar(255)NoThe GOCDB site assigning the IP
CloudComputeServicevarchar(255)YesSee Cloud Usage Record
CloudTypevarchar(255)NoSee Cloud Usage Record
LocalUservarchar(255)NoSee Cloud Usage Record
LocalGroupvarchar(255)NoSee Cloud Usage Record
GlobalUserNamevarchar(255)NoSee Cloud Usage Record
FQANvarchar(255)NoSee Cloud Usage Record
IPVersionbyteNo4 or 6
IPCountint(11)NoThe number of IP addresses of IPVersion this user currently assigned to them

A JSON schema defining a valid Public IP Usage message can be found at: https://github.com/apel/apel/blob/9476bd86424f6162c3b87b6daf6b4270ceb8fea6/apel/db/__init__.py

GPU Usage Record

The fedcloud task force has agreed on an GPU Usage Record. The format uses many of the same fields as the Cloud Usage Record. A table defining Draft 4 – 24/02/2021 is shown below:

GPU Usage Record PropertyTypeNullDefinition
MeasurementMonthintNoThe month/year the reported usage should be assigned to. If the month/year is the current month/year, the usage should be up to the point of reporting.
MeasurementYearintNo
AssociatedRecordTypevarchar(255)NoThe context in which the reported usage was used. I.e. “cloud” for an accelerator attached to a VM.
AssociatedRecordvarchar(255)NoVMUUID if AssociatedRecordType is “cloud”
GlobalUserNamevarchar(255)YesSee the definition of your AssociatedRecordType
FQANvarchar(255)NoSee the definition of your AssociatedRecordType
SiteNamevarchar(255)NoSee the definition of your AssociatedRecordType
CountdecimalNoA count of the Accelerators attached to the VM. At the moment Accelerators are not shared among VMs but it will change when Accelerator virtualization is applied, so we should have the field at decimal type instead of integer (e.g. Count = 0.5 when it is shared between two VMs).
Coresint(11)YesTotal number of cores. i.e. So if an Accelerator has 64 cores and a VM has 2 like that attached then we would report: Count=2 and Processors=128
ActiveDurationint(11)YesActual usage duration of the Accelerator in seconds for the given month/year (in case some systems could report actual usage). At the moment, ActiveDuration will be the same as the AvailableDuration due to the limitation of currently used technologies (impossible to get ACCELERATOR utilization from outside of the VM, no ACCELERATOR hot-plug into running VM) but it may change in near future so it is good to have the fields separately. Set to AvailableDuration if ActiveDuration is omitted from the record
AvailableDurationint(11)NoTime accelerator was available in seconds for the given month/year (Wall)Time that a GPU was attached to a VM.
BenchmarkTypevarchar(255)YesName of benchmark used for normalization of times
BenchmarkdecimalYesValue of benchmark of Accelerator
Typevarchar(255)NoHigh level description of accelerator, i.e. GPU, FPGA, Other
Modelvarchar(255)Yesmodel number, spec, some other concept that 2 ACCELERATORs with the same number of cores might be different etc

APEL and accounting portal

Once generated, records are delivered to the central accounting repository using APEL SSM (Secure STOMP Messenger). SSM client packages can be obtained at https://apel.github.io. A Cloud Accounting Summary Usage Record has also been defined and summaries created on a daily basis from all the accounting records received from the Resource Providers are sent to the EGI Accounting Portal. The Accounting portal also runs SSM to receive these summaries and provides a web view of the accounting data received from the Resource Providers.

cASO delivers an implementation of the extractor probes for OpenStack.

Monitoring

The endpoints published in the Configuration Database are monitored via ARGO. Specific probes to check functionality and availability of services must be provided by service developers.

The current set of probes used for monitoring IaaS resources consists of:

  • Accounting probe (eu.egi.cloud.APEL-Pub): Checks if the cloud resource is publishing data to the Accounting repository
  • TCP checks (org.nagios.Broker-TCP, org.nagios.CDMI-TCP, and org.nagios.CloudBDII-Check): Basic TCP checks for services.
  • VM Marketplace probe (eu.egi.cloud.AppDB-Update): gets a predetermined image list from AppDB and checks its update interval.
  • PERUN probe (eu.egi.cloud.Perun-Check): connects to the server and checks the status by using internal PERUN interface.

Roadmap

The TCB-Cloud board defines the roadmap for the technical evolution of the EGI Cloud. All the components are continuously maintained to:

  • Improve their programmability, providing complete APIs specification in adequate format for facilitating the generation clients (e.g. following the OpenAPI initiative and Swagger).
  • Lower the barriers to integrate and operate resource centres in the federation by a) minimizing the number of components used; b) contributing code to upstream distributions; and c) use only public APIs of the Cloud Management Frameworks.

Currently, the EGI FedCloud TaskForce is focused on moving to a central operations model, where providers only need to integrate their system with EGI Check-in but do not need to deploy and configure the different tools (accounting, discovery, VMI management, etc.) locally but delegate this to a central EGI team.