Documentation – Regional Operator on Duty (ROD)

Providers: Overview

Mon, 01 Jan 0001 00:00:00 +0000

Introduction

This page was created to help you, as a ROD, start with ROD duties.

The section How to become a ROD member describes steps which needs to be taken before starting working as a ROD. The section ROD duties documents all the tasks that make up the ROD duties. Section Important to read introduces all documents which concern this activity and which are supposed to be read at the beginning. Section Operational Tools describes all tools used by ROD teams. Finally, Contact section is going to inform you how to contact others.

How to become a ROD member

There are few actions which needs to be taken before you start your work:

Get a valid grid certificate delivered by Certificate Authorities (CA) - this step is important because most of the tools used during the shift require certificate. Find EUGRIDPMA members.
Register to Dteam VO. Dteam membership will give you possibility to test sites and debug problems.
Register into GGUS tool as support staff. GGUS is a ticketing system which is used for operational purpose within EGI. With support staff role you will be able to reply on and update recorded tickets.
Register in Configuration Database, a central database which contains all the information about EGI Infrastructure (sites and people). To be ROD members you have to be registered in this database. It will allows you to perform step 5.
Request the Regional Staff role in the Configuration Database. Thanks to this role you will be recognized automatically in operations tools as ROD member. It gives you a several privileges in the database as well as in other tools.
Contact your NGI manager - you need to contact your NGI manager to be approved as Regional Staff and to be added to ROD mailing list in your NGI (this mailing list is a contact point to the whole ROD team within the NGI).
Get familiar with the ROD documentation, a single place where you will find all information relevant to your work as a ROD.

To see how to perform all those actions please watch video How to become a ROD member (7 steps which should be done to become a ROD member also).

ROD duties

The Regional Operations team is responsible for detecting problems, coordinating the diagnosis, and monitoring the problems through to a resolution. It monitors sites in their region, and react to problems identified by the monitors, either directly or indirectly, provide support to sites as needed, add to the knowledge base, and provide informational flow to oversight bodies in cases of non-reactive or non-responsive sites. ROD is a team responsible for solving problems on the infrastructure according to agreed procedures. They ensure that problems are properly recorded and progress according to specified time lines. They ensure that necessary information is available to all parties. The team is provided by each Operation Centre and requires procedural knowledge on the process (rather than technical skills) for their work.

All duties listed are mandatory for ROD team:

Handling incidents. The main responsibility of ROD is to deal with incidents at sites in the region. This includes making sure that the tickets are opened and handled properly. The procedure for handling tickets is described in EGI Infrastructure Oversight escalation procedure
Propagate actions from EGI Operations down to sites. ROD is responsible for ensuring that decisions taken on the EGI Operations level are propagated to sites.
Putting a site in downtime or suspend for urgent matters. In general, ROD can place a site in downtime (in the Configuration Database) if it is either requested by the site, or ROD sees an urgent need to put the site into downtime. ROD may also suspend a site, under exceptional circumstances, without going through all the steps of the escalation procedure. For example, if a security hazard occurs, ROD must suspend a site on the spot in the case of such an emergency. It is important to know that EGI Operations can also suspend a site in the case of an emergency e.g. security incidents or lack of response.
Notify EGI Operations about core or urgent matters. ROD should create Helpdesk tickets to EGI Operations in the case of core or urgent matters.

Important to read

Before you start your duties you should get familiar with following documents:

EGI Infrastructure Oversight escalation procedure. This document defines escalation procedure for operational problems. It describes steps and timelines which ROD team should follow.
Dashboard How-Tos and Training Guides. A collection of How-Tos and guides for EGI Operations. It includes a Dashboard How-To, Training Guides which can be used as a presentation for training staff and quick sheets.
ROD FAQ. Frequently Asked Questions related to ROD work

It is also important to watch video tutorials prepared for ROD teams. They will walk you through several topics which are important for your work.

Operational Tools

ROD uses several operational tools to perform theirs duties (Operations tools video):

Operations Portal. Dashboard tool on the Operations Portal is a main tool which is used by ROD teams. All actions concerning incidents (alarms and tickets) should be performed using this tool.
Service Monitoring (ARGO) is the official EGI monitoring system based on Nagios. It checks the availability of the services and creates alarms visible on the Operations Portal dashboard when a failure occurs.
Helpdesk is the EGI central helpdesk system designed for reporting and tracking problems.
Configuration Database is a central database which contains all static information about the infrastructure (sites and people).

Contact

Each ROD teams is supposed to provide own mailing list as a contact point to the team. The list of people responsible for ROD in a given NGI and contact points can be found in the EGI Configuration Database.

All ROD mailing list are subscribed to “all-central-operator-on-duty AT mailman.egi.eu” mailing list so to contact other ROD teams you can use this list.

To contact EGI Operations team you can:

send an Helpdesk ticket and assign it to EGI Operation support unit
send an email to “operations AT egi.eu”

You are welcome to send us questions in case of any doubts concerning ROD duties.

Providers: Duties

Mon, 01 Jan 0001 00:00:00 +0000

Introduction

A ROD team’s duties can be split into three main areas: handling alarms and tickets, handling downtimes, and communicating urgent issues to the EGI Operations and CSIRT teams.

Handling alarms and tickets

The main responsibility of ROD is to deal with alarms and tickets issued for sites in the region. This includes making sure that the tickets are created and handled properly.

The ROD on duty is required to:

check alarm notifications in the Dashboard at least twice a day;
close alarms which are in the OK state;
handle non-OK alarms less than 24 hours old (notify the site administrators according to your NGI’s procedures);
create tickets for alarms older then 24 hours that are not in an OK state;
escalate tickets to NGI Management/EGI Operations if necessary (in the Dashboard);
monitor and update any GGUS tickets up to the solved status (preferably via the Dashboard);
handle the final state of GGUS tickets not opened from the Dashboard by changing their status to verified.

Putting a site in downtime for urgent matters

ROD can place a site or a service endpoint (there can be multiple services running on a single host) in downtime in the GOCDB if it is either requested by the site, or if ROD sees an urgent need to do it.

Note: This is actually optional; an NGI may decide on a different policy if the site admins are not happy with ROD setting downtimes for them. However, it should be considered mandatory in case of urgent security incidents.

ROD may also suspend a site, under exceptional circumstances, without going through all the steps of the escalation procedure. For example, if a security hazard occurs, ROD must suspend a site on the spot in the case of such an emergency. It is important to know that EGI Operations can also suspend a site in the case of an emergency, for example as a result of a security incident or lack of response.

In both scenarios, it is important that ROD communicates their actions to all involved parties.

Notifying EGI Operations and EGI CSIRT about urgent matters

ROD should create tickets to EGI Operations in the case of urgent matters. For security related issues, ROD should also notify the CSIRT duty contact.

ROD is also responsible for propagating actions from EGI Operations down to sites (this occurs rather infrequently, though).

Providers: Communication channels

Mon, 01 Jan 0001 00:00:00 +0000

ROD intra-team communication

At the end of a shift the current ROD team should prepare the hand over for internal ROD matters. Each ROD can decide independently on what the hand over should look like and how it should be passed on to the next team. The following list provides mere suggestions for what should be included:

a list of tickets which will continue into the next week. Each item should contain the name of the site in question, EGI Helpdesk ticket number, an optional ROD ticket ID if your NGI uses an internal ticket system, and the current status of the ticket;
any tickets opened that are not related to a particular alarm;
a summary of problems encountered with core grid services;
a report of any problems with operational tools that occurred during the shift;
anything else the new team should be aware of.

For internal communication ROD can use mailing list(s), instant messengers, etc. Each ROD team is free to choose how the internal communication is established.

Communication with EGI Operations and site administrators

ROD should provide an email contact to where all ticket information should be sent and register this address into the EGI Helpdesk. Another address (or possibly the same) should be made available to make it possible for EGI Operations, site administrators, or other bodies to contact them directly.

ROD should communicate with EGI Operations through the mailing list “operations (AT) egi.eu”. Urgent matters should be communicated via a Helpdesk ticket assigned to the EGI Operations support unit to make tracking of the case possible.

There is also the hand over section in the ROD dashboard which allows for EGI Operations team and RODs to intercommunicate.

Providers: Dealing with security incidents

Mon, 01 Jan 0001 00:00:00 +0000

Some procedures to deal with security events

Sites facing or suspecting a security incident on their resources have to follow Incident Handling Procedure

New vulnerability issues of the middleware should be handled as defined in the related procedure with a guide what to do when you find a vulnerability.

Sites having critical vulnerabilities are handled according to SEC03 EGI-CSIRT Critical Vulnerability Handling Procedure, and if they do not respond properly, they may face suspension.

Providers: Managing downtimes

Mon, 01 Jan 0001 00:00:00 +0000

Downtimes

To properly manage downtimes it is important to highlight the differences between the possible types of downtimes. Downtimes can’t be added retroactively, and the date is always defined in UTC. If the error messages received when adding a downtime is cryptic, usually it’s due to a parsing error on the time/date.

Downtimes classification

scheduled: e.g. for software/hardware upgrades, planned and agreed in advance. This needs to be announced at least 24h in advance.
unscheduled: e.g. power outages, unplanned, usually triggered by an unexpected failure.

Downtimes severities

WARNING: Resource will probably be working as normal, but may experience problems.
OUTAGE: Resource will be completely unavailable.

WARNING, formerly known as AT_RISK, implies a type of severity that does not have operational consequences. It is only an information for users that some small temporary failures can appear. All failures during that time will be taken into account in the reliability calculations. Examples include:

Admins not present on site (conference, vacations).
Reduced redundancy in network, power or cooling.
Failed disk in RAID sets.

OUTAGE implies that the site/node is completely unavailable and no tickets should be created. This does not affect site metrics.

More information about downtimes can be found in the Configuration Database user documentation.

Sites in downtime

When a ticket has been raised against a site that subsequently enters in a downtime, the expiry date on the ticket can be extended.

When a ticket is opened against a site that continues to add or increase the downtime the ticket must be closed, and the NGI requested to take action either by suspending or un-certifying the site until such time that the problem is resolved. This usually happens when a middleware upgrade is due or a bug in the middleware is causing a site to fail. Sites then may choose to wait for the next middleware release rather than spend effort trying to resolve the issue locally.

Sites that are in downtime will still have monitoring switched on and therefore may appear to be failing tests. ROD must take care that when opening tickets to ensure that they don’t open tickets against sites in downtime.

If a site is in downtime for more than a month, then it is advised that the site should go to the uncertified state.

Nodes in downtime

When a node of a site is in downtime, alarms are generated but the Operations Portal distinguishes these alarms, and marks the downtime accordingly in the dashboard.

ROD should not open tickets against nodes that are in downtime.

Instructions for accounting monitoring failures

The Accounting monitoring tests are not run against the site but query the central accounting repository.

If there is more than one failure for a given site, create a ticket for one of the alarms and mask all others by this one.
Edit the description of the ticket to state clearly that even though the failure is reported for a given CE, this is not a CE failure but a failure on the Accounting service for the whole site.
Proceed with all sites in the same way. Please beware: Accounting tests are not helped by scheduling downtime, the site admins need to get Accounting publishing working again.

Nodes not in production

When a node of a production site is declared as non-production in the Configuration Database or the node appears in BDII but is not declared in the Configuration Database, then the ROD should do the following:

Recommend to the sites to take these nodes out of their site BDII
If this is not a possibility then the site should set those nodes in downtime in the EGI Configuration Database
If the node is a test node and is in BDII but not in the Configuration Database, then the sites should register it and turn monitoring off.

Providers: Handling alarms and tickets

Mon, 01 Jan 0001 00:00:00 +0000

Alarms

Alarms are automatically generated notifications created by the Service Monitoring and are handled from within the Operations Portal dashboard.

Handling alarms

When an alarm is generated, the site administrators have 24 hours to start acting on the issue. If ROD spots an alarm, he can notify the site’s administrators about the problem.

If the problem is fixed within 24 hours and the solution is tested by the Service Monitoring (the alarm’s color turns green in the Dashboard), then ROD has to make sure that the results are not flapping and can close alarm without any other action.
If the problem cannot be fixed within 24 hours, and the site administrators put the service into an unscheduled downtime, ROD should just wait until the problem is fixed or until the downtime is over. No other action is necessary.
If the problem cannot be fixed within 24 hours and the administrators don’t put the service into a downtime, then a ticket must be issued. The procedure is described below in the Tickets section.
If the service is in downtime and the problem is fixed (as verified by the Service Monitoring), then ROD can close the alarm.
If the downtime is over and the problem is still present (e.g. if the administrators forgot to extend the downtime), then a ticket must be issued.
If an alarm is raised for a service that has its monitoring status set to OFF in EGI Configuration Database (also visible in the Dashboard in the Nodes box or in the alarm row as Node status) then ROD should not open a ticket.
The alarm can be cleared even if it is marked red by pressing the lightning icon and giving an explanation.

For handling tickets during public holidays, see below. There is also a video tutorial on handling alarms available.

Tickets

In contrast with alarms, which are mere notifications, tickets are created manually. They are used to report problems to the responsible support units. Additionally, they allow to track the actions taken in order to resolve the issue.

Creating tickets

Ticket creation occurs when the age of an alarm in an error state has passed 24 hours, whether or not a site has already made some action on the alarm. A ticket has to be created from the Operations Portal. In order to actually create a ticket, click on the double arrow in the upper left corner next to the NGI name. That opens the drop down box with information on the site. Then, open the New NAGIOS alarms drop down box. Click the “T+” icon to create the ticket.

Refer to the Dashboard How-to if you need a more detailed guide.

If more than one alarm should be handled by the same ticket, proceed as follows:

Create a ticket for one of the alarms.
Open the Assigned Alarms drop-down box. Click on the mask icon next to the alarm identifier.
A window will open in which you can select the alarms to be masked.

If an alarm, which is masked by another alarm, remains in “critical” condition because of another (unrelated) problem, you can unmask it by clicking on the mask icon again and close the ticket for the solved alarms.

Fill in the relevant information in the ticket section. If there was information in the site notepad, ensure that the ticket information reflects that information. Also ensure that the TO: select boxes, and FROM: and SUBJECT: fields are all correct. Generally, a ticket should go to all of site, NGI and ROD.
Press the Submit button and a pop up window will appear confirming that the ticket was correctly submitted. Your ticket has now been assigned a Helpdesk ID, but also an internal (hidden) Dashboard ID, which means that if you create a ticket through Dashboard, you have to close it through Dashboard as well. If you close a ticket opened through Dashboard in the Helpdesk, it will remain open in Dashboard!

Creating tickets without an alarm

It is also possible to create a ticket for a site without an alarm. This can happen if there is an issue with one of the tools that does not create an alarm in Dashboard. In this case, click on the “T+” icon in the upper right corner of the site box - the one with “Create a ticket (without an alarm)” tooltip, and fill in the appropriate fields as when creating a ticket for an alarm.

Ticket content templates

The email is addressed to the corresponding NGI, together with the site and ROD. To view the list of NGI email addresses, click the Regional List link in the Dashboard menu.

Generally, you should not remove any content from the template, but you are free to add any information you think the site might find helpful in any of the three editing fields (Header content, Main content, and Footer content).

Changing the state of and closing a ticket

When the state of an alarm for a site with an open ticket changes to OK, then the ticket associated with that alarm can be updated in the Dashboard. Do this by clicking Update for the ticket in the Tickets drop down. Now change Escalate to Problem solved and fill in any information about how the problem has been solved. Clicking Update will then close the ticket in both the Helpdesk and the Dashboard.
If the Nagios alarm is in an unstable state, and the site has not responded to the problem in 3 days then a 2nd email can be sent to the site by updating the Escalate field to 2nd step.
If a new failure is detected for the site, the existing ticket should not be modified (though the deadline can be extended) but a new ticket should be submitted for this new problem.
If the site’s problem can not be fixed in 3 days from the 2nd step of the escalation procedure then escalate the ticket to Political procedure. This means that the NGI manager will contact both EGI Operations and the site to negotiate about suspending the site.

Sites with multiple tickets open

When opening a ticket against a site with existing tickets ROD should consider that these problems may be linked or dependant on pending solutions. If the problem is different but maybe linked the expiry dates for each ticket should be synchronized to the latest date.

Also consider masking new problems with an old ticket.

Handling alarms and tickets during weekends and public holidays

Due to the fact that weekends are not considered working days, it is noted that ROD teams do not have any responsibilities during weekends and that RODs should ensure that tickets do not expire during weekends. The alarm age does not increase during the weekend.

Currently there is no automatic mechanism for handling ticket expiration over public holiday periods, because they differ among countries. If some of the sites the ROD team is in charge of are located in another country, the ROD is encouraged to get them to announce their public holidays, so that ticket expiration can be set accordingly. (Correspondingly, ROD operators also have no duties when they are on public holidays.) The ROD can edit the ticket’s expiration day by clicking the “T+” (Edit Ticket) icon. The value is set to 3 days by default.

Please note that ROD is not requested to announce their national holidays to the EGI Operations team. However, the last day before a public holiday, ROD is requested to check

if there are any tickets that are to be expired during the holiday and change their expiration date;
if there are any alarms that will pass the 72 hour period during the holidays and handle them properly in advance.

Workflow and escalation procedure

The workflow and escalation procedures are documented in more detail at PROC01 Infrastructure Oversight escalation.

Providers: FAQ

Mon, 01 Jan 0001 00:00:00 +0000

How to handle issues during weekends and public holidays?

Due to the fact that weekends and public holidays are not considered working days it is noted that ROD teams do not have any responsibilities during these days. RODs should ensure that in these days tickets do not expire and alarms will not age above 72h.

What to do with alarms when node is not in production and is part of production site?

It often happens that testing nodes on production sites are set as non-production. In such case Nagios monitoring system will send information about all nodes. As a result ROD will see on their dashboard alarms for non-production node. If it necessary to monitor such testing node it is recommended to put such non-production node in downtime.

What to do when a sites have multiple alarms/ticket?

When opening a ticket against a site with existing tickets ROD should consider that these problems may be linked or dependant on pending solutions. In such case ROD should use grouping mechanism to gather and assign alarms to one ticket rather than open a ticket for each alarm.

If the problem is different but maybe linked the expiry dates for each ticket should be synchronized to the latest date.

How to handle issues for site/node in downtime?

Handling tickets for site/node in downtime

When a ticket has been raised against a site that subsequently enters in downtime, the expiry date on the ticket can be extended.

Sites that are in downtime will still have monitoring switched on and therefore may appear to be failing tests but no alarms on Operations Portal will be raised against them. ROD must take care that when opening tickets to ensure that they don’t open tickets against sites in downtime.

Handling alarms for site/node in downtime

It often happens that a failure occurred generating a lot of alarms and then site manager decided to put site in Downtime. Getting these alarms OK may take more than 72h when the issue is escalated to Operations.

ROD should not create a ticket for sites/nodes in Downtime and is not obligated to deal with such alarms but it is recommended to close these alarms to avoid being escalated to Operations. In such case as a reason of closing NON-OK alarm ROD should put link to the downtime in the EGI Configuration Database.

Site in downtime for more than a month

If a site is in DOWNTIME for more than a month then it is advised that the site should go to the suspended status.

What to do in case of accounting issue?

In case of problems with accounting it is not recommended to suggest downtime at the second step of the escalation process for this test. Accounting service is not a functionality which is critical for users but it still need to be follow up.

Watch out for flapping states

You may want to wait for a second test to be run before closing an alarm which is in an OK status. This ensures that the OK result for that tests is stable. The waiting period is, of course, dependent on how long the test takes and how frequently it is checked.

How to handle the eu.egi.lowAvailability alarm?

Go to procedure PROC04 Quality verification of monthly availability and reliability statistics.