<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Documentation – Regional Operator on Duty (ROD)</title><link>/providers/rod/</link><description>Recent content in Regional Operator on Duty (ROD) on Documentation</description><generator>Hugo -- gohugo.io</generator><atom:link href="/providers/rod/index.xml" rel="self" type="application/rss+xml"/><item><title>Providers: Overview</title><link>/providers/rod/overview/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/providers/rod/overview/</guid><description>
&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>This page was created to help you, as a ROD, start with ROD duties.&lt;/p>
&lt;p>The section &lt;a href="#how-to-become-a-rod-member">How to become a ROD member&lt;/a> describes
steps which needs to be taken before starting working as a ROD. The section
&lt;a href="#rod-duties">ROD duties&lt;/a> documents all the tasks that make up the ROD duties.
Section &lt;a href="#important-to-read">Important to read&lt;/a> introduces all documents which
concern this activity and which are supposed to be read at the beginning.
Section &lt;a href="#operational-tools">Operational Tools&lt;/a> describes all tools used by ROD
teams. Finally, &lt;a href="#contact">Contact&lt;/a> section is going to inform you how to
contact others.&lt;/p>
&lt;h2 id="how-to-become-a-rod-member">How to become a ROD member&lt;/h2>
&lt;p>There are few actions which needs to be taken before you start your work:&lt;/p>
&lt;ol>
&lt;li>Get a valid grid certificate delivered by Certificate Authorities (CA) - this
step is important because most of the tools used during the shift require
certificate. &lt;a href="https://www.eugridpma.org/members/worldmap/">Find&lt;/a> EUGRIDPMA
members.&lt;/li>
&lt;li>&lt;a href="https://voms2.hellasgrid.gr:8443/voms/dteam/">Register to Dteam VO&lt;/a>. Dteam
membership will give you possibility to test sites and debug problems.&lt;/li>
&lt;li>&lt;a href="../../../internal/helpdesk/account-and-privileges/#getting-supporter-privileges">Register into GGUS tool as support staff&lt;/a>.
GGUS is a ticketing system which is used for operational purpose within EGI.
With support staff role you will be able to reply on and update recorded
tickets.&lt;/li>
&lt;li>&lt;a href="https://goc.egi.eu/portal/index.php?Page_Type=Role_Requests">Register&lt;/a> in
&lt;a href="../../../internal/configuration-database">Configuration Database&lt;/a>, a central
database which contains all the information about EGI Infrastructure (sites
and people). To be ROD members you have to be registered in this database. It
will allows you to perform step 5.&lt;/li>
&lt;li>Request the &lt;strong>Regional Staff&lt;/strong> role in the Configuration Database. Thanks to
this role you will be recognized automatically in operations tools as ROD
member. It gives you a several privileges in the database as well as in other
tools.&lt;/li>
&lt;li>Contact your NGI manager - you need to contact your NGI manager to be
approved as &lt;strong>Regional Staff&lt;/strong> and to be added to ROD mailing list in your
NGI (this mailing list is a contact point to the whole ROD team within the
NGI).&lt;/li>
&lt;li>Get familiar with the &lt;a href="../../rod">ROD documentation&lt;/a>, a single place where
you will find all information relevant to your work as a ROD.&lt;/li>
&lt;/ol>
&lt;p>To see how to perform all those actions please watch video
&lt;a href="https://www.youtube.com/watch?v=p-SrqJMDlOo">How to become a ROD member&lt;/a> (7
steps which should be done to become a ROD member also).&lt;/p>
&lt;h2 id="rod-duties">ROD duties&lt;/h2>
&lt;p>The Regional Operations team is responsible for detecting problems, coordinating
the diagnosis, and monitoring the problems through to a resolution. It monitors
sites in their region, and react to problems identified by the monitors, either
directly or indirectly, provide support to sites as needed, add to the knowledge
base, and provide informational flow to oversight bodies in cases of
non-reactive or non-responsive sites. ROD is a team responsible for solving
problems on the infrastructure according to agreed procedures. They ensure that
problems are properly recorded and progress according to specified time lines.
They ensure that necessary information is available to all parties. The team is
provided by each Operation Centre and requires procedural knowledge on the
process (rather than technical skills) for their work.&lt;/p>
&lt;p>All duties listed are mandatory for ROD team:&lt;/p>
&lt;ul>
&lt;li>Handling incidents. The main responsibility of ROD is to deal with incidents
at sites in the region. This includes making sure that the tickets are opened
and handled properly. The procedure for handling tickets is described in
&lt;a href="https://go.egi.eu/proc01">EGI Infrastructure Oversight escalation procedure&lt;/a>&lt;/li>
&lt;li>Propagate actions from EGI Operations down to sites. ROD is responsible for
ensuring that decisions taken on the EGI Operations level are propagated to
sites.&lt;/li>
&lt;li>Putting a site in downtime or suspend for urgent matters. In general, ROD can
place a site in downtime (in the
&lt;a href="../../../internal/configuration-database/downtimes">Configuration Database&lt;/a>)
if it is either requested by the site, or ROD sees an urgent need to put the
site into downtime. ROD may also suspend a site, under exceptional
circumstances, without going through all the steps of the escalation
procedure. For example, if a security hazard occurs, ROD must suspend a site
on the spot in the case of such an emergency. It is important to know that EGI
Operations can also suspend a site in the case of an emergency e.g. security
incidents or lack of response.&lt;/li>
&lt;li>Notify &lt;strong>EGI Operations&lt;/strong> about core or urgent matters. ROD should create
&lt;a href="../../../internal/helpdesk">Helpdesk&lt;/a> tickets to &lt;strong>EGI Operations&lt;/strong> in the
case of core or urgent matters.&lt;/li>
&lt;/ul>
&lt;h2 id="important-to-read">Important to read&lt;/h2>
&lt;p>Before you start your duties you should get familiar with following documents:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://go.egi.eu/proc01">EGI Infrastructure Oversight escalation procedure&lt;/a>.
This document defines escalation procedure for operational problems. It
describes steps and timelines which ROD team should follow.&lt;/li>
&lt;li>&lt;a href="https://documents.egi.eu/document/301">Dashboard How-Tos and Training Guides&lt;/a>.
A collection of How-Tos and guides for EGI Operations. It includes a Dashboard
How-To, Training Guides which can be used as a presentation for training staff
and quick sheets.&lt;/li>
&lt;li>&lt;a href="../faq">ROD FAQ&lt;/a>. Frequently Asked Questions related to ROD work&lt;/li>
&lt;/ul>
&lt;p>It is also important to watch &lt;a href="../#manuals-and-procedures">video tutorials&lt;/a>
prepared for ROD teams. They will walk you through several topics which are
important for your work.&lt;/p>
&lt;h3 id="operational-tools">Operational Tools&lt;/h3>
&lt;p>ROD uses several operational tools to perform theirs duties
(&lt;a href="https://www.youtube.com/watch?v=bNm4oupAmqI">Operations tools video&lt;/a>):&lt;/p>
&lt;ul>
&lt;li>&lt;a href="../../../internal/operations-portal/">Operations Portal&lt;/a>. Dashboard tool on
the Operations Portal is a main tool which is used by ROD teams. All actions
concerning incidents (alarms and tickets) should be performed using this tool.&lt;/li>
&lt;li>&lt;a href="../../../internal/monitoring/">Service Monitoring (ARGO)&lt;/a> is the official EGI
monitoring system based on Nagios. It checks the availability of the services
and creates alarms visible on the Operations Portal dashboard when a failure
occurs.&lt;/li>
&lt;li>&lt;a href="../../../internal/helpdesk/">Helpdesk&lt;/a> is the EGI central helpdesk system
designed for reporting and tracking problems.&lt;/li>
&lt;li>&lt;a href="../../../internal/configuration-database/">Configuration Database&lt;/a> is a
central database which contains all static information about the
infrastructure (sites and people).&lt;/li>
&lt;/ul>
&lt;h2 id="contact">Contact&lt;/h2>
&lt;p>Each ROD teams is supposed to provide own mailing list as a contact point to the
team. The list of people responsible for ROD in a given NGI and contact points
can be found in the
&lt;a href="../../../internal/configuration-database">EGI Configuration Database&lt;/a>.&lt;/p>
&lt;p>All ROD mailing list are subscribed to &amp;ldquo;all-central-operator-on-duty AT
mailman.egi.eu&amp;rdquo; mailing list so to contact other ROD teams you can use this
list.&lt;/p>
&lt;p>To contact EGI Operations team you can:&lt;/p>
&lt;ul>
&lt;li>send an &lt;a href="../../../internal/helpdesk">Helpdesk&lt;/a> ticket and assign it to &lt;strong>EGI
Operation&lt;/strong> support unit&lt;/li>
&lt;li>send an email to &amp;ldquo;operations AT egi.eu&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;p>You are welcome to send us questions in case of any doubts concerning ROD
duties.&lt;/p></description></item><item><title>Providers: Duties</title><link>/providers/rod/duties/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/providers/rod/duties/</guid><description>
&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>A ROD team&amp;rsquo;s duties can be split into three main areas: handling alarms and
tickets, handling downtimes, and communicating urgent issues to the EGI
Operations and CSIRT teams.&lt;/p>
&lt;h2 id="handling-alarms-and-tickets">Handling alarms and tickets&lt;/h2>
&lt;p>The main responsibility of ROD is to deal with alarms and tickets issued for
sites in the region. This includes making sure that the tickets are created and
handled properly.&lt;/p>
&lt;p>The ROD on duty is required to:&lt;/p>
&lt;ul>
&lt;li>check alarm notifications in the Dashboard at least twice a day;&lt;/li>
&lt;li>close alarms which are in the OK state;&lt;/li>
&lt;li>handle non-OK alarms less than 24 hours old (notify the site administrators
according to your NGI&amp;rsquo;s procedures);&lt;/li>
&lt;li>create tickets for alarms older then 24 hours that are not in an OK state;&lt;/li>
&lt;li>escalate tickets to NGI Management/EGI Operations if necessary (in the
Dashboard);&lt;/li>
&lt;li>monitor and update any GGUS tickets up to the solved status (preferably via
the Dashboard);&lt;/li>
&lt;li>handle the final state of GGUS tickets not opened from the Dashboard by
changing their status to verified.&lt;/li>
&lt;/ul>
&lt;h2 id="putting-a-site-in-downtime-for-urgent-matters">Putting a site in downtime for urgent matters&lt;/h2>
&lt;p>ROD can place a site or a service endpoint (there can be multiple services
running on a single host) in downtime in the GOCDB if it is either requested by
the site, or if ROD sees an urgent need to do it.&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Note&lt;/strong>: This is actually optional; an NGI may decide on a different policy
if the site admins are not happy with ROD setting downtimes for them. However,
it should be considered mandatory in case of urgent security incidents.&lt;/p>
&lt;/blockquote>
&lt;p>ROD may also suspend a site, under exceptional circumstances, without going
through all the steps of the escalation procedure. For example, if a security
hazard occurs, ROD must suspend a site on the spot in the case of such an
emergency. It is important to know that EGI Operations can also suspend a site
in the case of an emergency, for example as a result of a security incident or
lack of response.&lt;/p>
&lt;p>In both scenarios, it is important that ROD communicates their actions to all
involved parties.&lt;/p>
&lt;h2 id="notifying-egi-operations-and-egi-csirt-about-urgent-matters">Notifying EGI Operations and EGI CSIRT about urgent matters&lt;/h2>
&lt;p>ROD should create tickets to EGI Operations in the case of urgent matters. For
security related issues, ROD should also notify the
&lt;a href="https://confluence.egi.eu/display/EGIBG/CSIRT">CSIRT&lt;/a> duty contact.&lt;/p>
&lt;p>ROD is also responsible for propagating actions from EGI Operations down to
sites (this occurs rather infrequently, though).&lt;/p></description></item><item><title>Providers: Communication channels</title><link>/providers/rod/communication/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/providers/rod/communication/</guid><description>
&lt;h2 id="rod-intra-team-communication">ROD intra-team communication&lt;/h2>
&lt;p>At the end of a shift the current ROD team should prepare the hand over for
internal ROD matters. Each ROD can decide independently on what the hand over
should look like and how it should be passed on to the next team. The following
list provides mere suggestions for what should be included:&lt;/p>
&lt;ul>
&lt;li>a list of tickets which will continue into the next week. Each item should
contain the name of the site in question, EGI Helpdesk ticket number, an
optional ROD ticket ID if your NGI uses an internal ticket system, and the
current status of the ticket;&lt;/li>
&lt;li>any tickets opened that are not related to a particular alarm;&lt;/li>
&lt;li>a summary of problems encountered with core grid services;&lt;/li>
&lt;li>a report of any problems with operational tools that occurred during the
shift;&lt;/li>
&lt;li>anything else the new team should be aware of.&lt;/li>
&lt;/ul>
&lt;p>For internal communication ROD can use mailing list(s), instant messengers, etc.
Each ROD team is free to choose how the internal communication is established.&lt;/p>
&lt;h2 id="communication-with-egi-operations-and-site-administrators">Communication with EGI Operations and site administrators&lt;/h2>
&lt;p>ROD should provide an email contact to where all ticket information should be
sent and register this address into the EGI Helpdesk. Another address (or
possibly the same) should be made available to make it possible for EGI
Operations, site administrators, or other bodies to contact them directly.&lt;/p>
&lt;p>ROD should communicate with EGI Operations through the mailing list &amp;ldquo;operations
(AT) egi.eu&amp;rdquo;. Urgent matters should be communicated via a Helpdesk ticket
assigned to the EGI Operations support unit to make tracking of the case
possible.&lt;/p>
&lt;p>There is also the hand over section in the ROD dashboard which allows for EGI
Operations team and RODs to intercommunicate.&lt;/p></description></item><item><title>Providers: Dealing with security incidents</title><link>/providers/rod/security/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/providers/rod/security/</guid><description>
&lt;h2 id="some-procedures-to-deal-with-security-events">Some procedures to deal with security events&lt;/h2>
&lt;p>Sites facing or suspecting a security incident on their resources have to follow
&lt;a href="https://confluence.egi.eu/display/EGIBG/CSIRT+Incident+reporting">Incident Handling Procedure&lt;/a>&lt;/p>
&lt;p>New vulnerability issues of the middleware should be handled as defined in the
&lt;a href="https://documents.egi.eu/public/ShowDocument?docid=3145">related procedure&lt;/a>
with a guide what to do when &lt;a href="https://go.egi.eu/svg">you find a vulnerability&lt;/a>.&lt;/p>
&lt;p>Sites having critical vulnerabilities are handled according to
&lt;a href="https://go.egi.eu/sec03">SEC03 EGI-CSIRT Critical Vulnerability Handling Procedure&lt;/a>,
and if they do not respond properly, they may face suspension.&lt;/p></description></item><item><title>Providers: Managing downtimes</title><link>/providers/rod/downtimes/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/providers/rod/downtimes/</guid><description>
&lt;h2 id="downtimes">Downtimes&lt;/h2>
&lt;p>To properly manage downtimes it is important to highlight the differences
between the possible types of downtimes. Downtimes can&amp;rsquo;t be added retroactively,
and the date is always defined in UTC. If the error messages received when
adding a downtime is cryptic, usually it&amp;rsquo;s due to a parsing error on the
time/date.&lt;/p>
&lt;h3 id="downtimes-classification">Downtimes classification&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>scheduled&lt;/strong>: e.g. for software/hardware upgrades, planned and agreed in
advance. This needs to be announced at least 24h in advance.&lt;/li>
&lt;li>&lt;strong>unscheduled&lt;/strong>: e.g. power outages, unplanned, usually triggered by an
unexpected failure.&lt;/li>
&lt;/ol>
&lt;h3 id="downtimes-severities">Downtimes severities&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>WARNING&lt;/strong>: Resource will probably be working as normal, but may experience
problems.&lt;/li>
&lt;li>&lt;strong>OUTAGE&lt;/strong>: Resource will be completely unavailable.&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>WARNING&lt;/strong>, formerly known as &lt;strong>AT_RISK&lt;/strong>, implies a type of severity that does
not have operational consequences. It is only an information for users that some
small temporary failures can appear. All failures during that time will be taken
into account in the reliability calculations. Examples include:&lt;/p>
&lt;ul>
&lt;li>Admins not present on site (conference, vacations).&lt;/li>
&lt;li>Reduced redundancy in network, power or cooling.&lt;/li>
&lt;li>Failed disk in RAID sets.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>OUTAGE&lt;/strong> implies that the site/node is completely unavailable and no tickets
should be created. This does not affect site metrics.&lt;/p>
&lt;p>More information about downtimes can be found in the
&lt;a href="../../../internal/configuration-database/downtimes/">Configuration Database user documentation&lt;/a>.&lt;/p>
&lt;h2 id="sites-in-downtime">Sites in downtime&lt;/h2>
&lt;p>When a ticket has been raised against a site that subsequently enters in a
downtime, the expiry date on the ticket can be extended.&lt;/p>
&lt;p>When a ticket is opened against a site that continues to add or increase the
downtime the ticket must be closed, and the NGI requested to take action either
by suspending or un-certifying the site until such time that the problem is
resolved. This usually happens when a middleware upgrade is due or a bug in the
middleware is causing a site to fail. Sites then may choose to wait for the next
middleware release rather than spend effort trying to resolve the issue locally.&lt;/p>
&lt;p>Sites that are in downtime will still have monitoring switched on and therefore
may appear to be failing tests. ROD must take care that when opening tickets to
ensure that they don’t open tickets against sites in downtime.&lt;/p>
&lt;blockquote>
&lt;p>If a site is in downtime for more than a month, then it is advised that the
site should go to the &lt;strong>uncertified&lt;/strong> state.&lt;/p>
&lt;/blockquote>
&lt;h3 id="nodes-in-downtime">Nodes in downtime&lt;/h3>
&lt;p>When a node of a site is in downtime, alarms are generated but the
&lt;a href="../../../internal/operations-portal">Operations Portal&lt;/a> distinguishes these
alarms, and marks the downtime accordingly in the dashboard.&lt;/p>
&lt;blockquote>
&lt;p>ROD should not open tickets against nodes that are in downtime.&lt;/p>
&lt;/blockquote>
&lt;h3 id="instructions-for-accounting-monitoring-failures">Instructions for accounting monitoring failures&lt;/h3>
&lt;p>The &lt;a href="../../../internal/accounting">Accounting&lt;/a> monitoring tests are not run
against the site but query the central accounting repository.&lt;/p>
&lt;ol>
&lt;li>If there is more than one failure for a given site, create a ticket for one
of the alarms and mask all others by this one.&lt;/li>
&lt;li>Edit the description of the ticket to state clearly that even though the
failure is reported for a given CE, this is not a CE failure but a failure on
the Accounting service for the whole site.&lt;/li>
&lt;li>Proceed with all sites in the same way. Please beware: Accounting tests are
not helped by scheduling downtime, the site admins need to get Accounting
publishing working again.&lt;/li>
&lt;/ol>
&lt;h2 id="nodes-not-in-production">Nodes not in production&lt;/h2>
&lt;p>When a node of a production site is declared as non-production in the
Configuration Database or the node appears in BDII but is not declared in the
Configuration Database, then the ROD should do the following:&lt;/p>
&lt;ul>
&lt;li>Recommend to the sites to take these nodes out of their site BDII&lt;/li>
&lt;li>If this is not a possibility then the site should set those nodes in downtime
in the
&lt;a href="../../../internal/configuration-database/downtimes">EGI Configuration Database&lt;/a>&lt;/li>
&lt;li>If the node is a test node and is in BDII but not in the Configuration
Database, then the sites should register it and turn monitoring off.&lt;/li>
&lt;/ul></description></item><item><title>Providers: Handling alarms and tickets</title><link>/providers/rod/alarms-tickets/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/providers/rod/alarms-tickets/</guid><description>
&lt;h2 id="alarms">Alarms&lt;/h2>
&lt;p>Alarms are automatically generated notifications created by the
&lt;a href="../../../internal/">Service Monitoring&lt;/a> and are handled from within the
&lt;a href="../../../internal/operations-portal">Operations Portal dashboard&lt;/a>.&lt;/p>
&lt;h3 id="handling-alarms">Handling alarms&lt;/h3>
&lt;p>When an alarm is generated, the site administrators have 24 hours to start
acting on the issue. If ROD spots an alarm, he can notify the site&amp;rsquo;s
administrators about the problem.&lt;/p>
&lt;ul>
&lt;li>If the problem is fixed within 24 hours and the solution is tested by the
Service Monitoring (the alarm&amp;rsquo;s color turns green in the Dashboard), then ROD
has to make sure that the results are not flapping and can close alarm without
any other action.&lt;/li>
&lt;li>If the problem cannot be fixed within 24 hours, and the site administrators
put the service into an unscheduled downtime, ROD should just wait until the
problem is fixed or until the downtime is over. No other action is necessary.&lt;/li>
&lt;li>If the problem cannot be fixed within 24 hours and the administrators don&amp;rsquo;t
put the service into a downtime, then a ticket must be issued. The procedure
is described below in the Tickets section.&lt;/li>
&lt;li>If the service is in downtime and the problem is fixed (as verified by the
Service Monitoring), then ROD can close the alarm.&lt;/li>
&lt;li>If the downtime is over and the problem is still present (e.g. if the
administrators forgot to extend the downtime), then a ticket must be issued.&lt;/li>
&lt;li>If an alarm is raised for a service that has its monitoring status set to OFF
in EGI Configuration Database (also visible in the Dashboard in the Nodes box
or in the alarm row as Node status) then ROD should not open a ticket.&lt;/li>
&lt;li>The alarm can be cleared even if it is marked red by pressing the lightning
icon and giving an explanation.&lt;/li>
&lt;/ul>
&lt;p>For handling tickets during public holidays, see below. There is also a video
tutorial on handling alarms available.&lt;/p>
&lt;h2 id="tickets">Tickets&lt;/h2>
&lt;p>In contrast with alarms, which are mere notifications, tickets are created
manually. They are used to report problems to the responsible support units.
Additionally, they allow to track the actions taken in order to resolve the
issue.&lt;/p>
&lt;h3 id="creating-tickets">Creating tickets&lt;/h3>
&lt;p>Ticket creation occurs when the age of an alarm in an error state has passed 24
hours, whether or not a site has already made some action on the alarm. A ticket
has to be created from the Operations Portal. In order to actually create a
ticket, click on the double arrow in the upper left corner next to the NGI name.
That opens the drop down box with information on the site. Then, open the New
NAGIOS alarms drop down box. Click the &amp;ldquo;T+&amp;rdquo; icon to create the ticket.&lt;/p>
&lt;p>Refer to the
&lt;a href="https://documents.egi.eu/public/ShowDocument?docid=301">Dashboard How-to&lt;/a> if
you need a more detailed guide.&lt;/p>
&lt;p>If more than one alarm should be handled by the same ticket, proceed as follows:&lt;/p>
&lt;ol>
&lt;li>Create a ticket for one of the alarms.&lt;/li>
&lt;li>Open the Assigned Alarms drop-down box. Click on the mask icon next to the
alarm identifier.&lt;/li>
&lt;li>A window will open in which you can select the alarms to be masked.&lt;/li>
&lt;/ol>
&lt;p>If an alarm, which is masked by another alarm, remains in &amp;ldquo;critical&amp;rdquo; condition
because of another (unrelated) problem, you can unmask it by clicking on the
mask icon again and close the ticket for the solved alarms.&lt;/p>
&lt;ol>
&lt;li>Fill in the relevant information in the ticket section. If there was
information in the site notepad, ensure that the ticket information reflects
that information. Also ensure that the TO: select boxes, and FROM: and
SUBJECT: fields are all correct. Generally, a ticket should go to all of
site, NGI and ROD.&lt;/li>
&lt;li>Press the Submit button and a pop up window will appear confirming that the
ticket was correctly submitted. Your ticket has now been assigned a Helpdesk
ID, but also an internal (hidden) Dashboard ID, which means that if you
create a ticket through Dashboard, you have to close it through Dashboard as
well. If you close a ticket opened through Dashboard in the Helpdesk, it will
remain open in Dashboard!&lt;/li>
&lt;/ol>
&lt;h3 id="creating-tickets-without-an-alarm">Creating tickets without an alarm&lt;/h3>
&lt;p>It is also possible to create a ticket for a site without an alarm. This can
happen if there is an issue with one of the tools that does not create an alarm
in Dashboard. In this case, click on the &amp;ldquo;T+&amp;rdquo; icon in the upper right corner of
the site box - the one with &amp;ldquo;Create a ticket (without an alarm)&amp;rdquo; tooltip, and
fill in the appropriate fields as when creating a ticket for an alarm.&lt;/p>
&lt;h3 id="ticket-content-templates">Ticket content templates&lt;/h3>
&lt;p>The email is addressed to the corresponding NGI, together with the site and ROD.
To view the list of NGI email addresses, click the Regional List link in the
Dashboard menu.&lt;/p>
&lt;p>Generally, you should not remove any content from the template, but you are free
to add any information you think the site might find helpful in any of the three
editing fields (Header content, Main content, and Footer content).&lt;/p>
&lt;h3 id="changing-the-state-of-and-closing-a-ticket">Changing the state of and closing a ticket&lt;/h3>
&lt;ol>
&lt;li>When the state of an alarm for a site with an open ticket changes to OK, then
the ticket associated with that alarm can be updated in the Dashboard. Do
this by clicking Update for the ticket in the Tickets drop down. Now change
Escalate to Problem solved and fill in any information about how the problem
has been solved. Clicking Update will then close the ticket in both the
Helpdesk and the Dashboard.&lt;/li>
&lt;li>If the Nagios alarm is in an unstable state, and the site has not responded
to the problem in 3 days then a 2nd email can be sent to the site by updating
the Escalate field to 2nd step.&lt;/li>
&lt;li>If a new failure is detected for the site, the existing ticket should not be
modified (though the deadline can be extended) but a new ticket should be
submitted for this new problem.&lt;/li>
&lt;li>If the site&amp;rsquo;s problem can not be fixed in 3 days from the 2nd step of the
escalation procedure then escalate the ticket to Political procedure. This
means that the NGI manager will contact both EGI Operations and the site to
negotiate about suspending the site.&lt;/li>
&lt;/ol>
&lt;h3 id="sites-with-multiple-tickets-open">Sites with multiple tickets open&lt;/h3>
&lt;p>When opening a ticket against a site with existing tickets ROD should consider
that these problems may be linked or dependant on pending solutions. If the
problem is different but maybe linked the expiry dates for each ticket should be
synchronized to the latest date.&lt;/p>
&lt;p>Also consider masking new problems with an old ticket.&lt;/p>
&lt;h2 id="handling-alarms-and-tickets-during-weekends-and-public-holidays">Handling alarms and tickets during weekends and public holidays&lt;/h2>
&lt;p>Due to the fact that weekends are not considered working days, it is noted that
ROD teams do not have any responsibilities during weekends and that RODs should
ensure that tickets do not expire during weekends. The alarm age does not
increase during the weekend.&lt;/p>
&lt;p>Currently there is no automatic mechanism for handling ticket expiration over
public holiday periods, because they differ among countries. If some of the
sites the ROD team is in charge of are located in another country, the ROD is
encouraged to get them to announce their public holidays, so that ticket
expiration can be set accordingly. (Correspondingly, ROD operators also have no
duties when they are on public holidays.) The ROD can edit the ticket&amp;rsquo;s
expiration day by clicking the &amp;ldquo;T+&amp;rdquo; (Edit Ticket) icon. The value is set to 3
days by default.&lt;/p>
&lt;p>Please note that ROD is not requested to announce their national holidays to the
EGI Operations team. However, the last day before a public holiday, ROD is
requested to check&lt;/p>
&lt;ul>
&lt;li>if there are any tickets that are to be expired during the holiday and change
their expiration date;&lt;/li>
&lt;li>if there are any alarms that will pass the 72 hour period during the holidays
and handle them properly in advance.&lt;/li>
&lt;/ul>
&lt;h2 id="workflow-and-escalation-procedure">Workflow and escalation procedure&lt;/h2>
&lt;p>The workflow and escalation procedures are documented in more detail at
&lt;a href="https://confluence.egi.eu/x/SiAmBg">PROC01 Infrastructure Oversight escalation&lt;/a>.&lt;/p></description></item><item><title>Providers: FAQ</title><link>/providers/rod/faq/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/providers/rod/faq/</guid><description>
&lt;h2 id="how-to-handle-issues-during-weekends-and-public-holidays">How to handle issues during weekends and public holidays?&lt;/h2>
&lt;p>Due to the fact that weekends and public holidays are not considered working
days it is noted that ROD teams do not have any responsibilities during these
days. RODs should ensure that in these days tickets do not expire and alarms
will not age above 72h.&lt;/p>
&lt;h2 id="what-to-do-with-alarms-when-node-is-not-in-production-and-is-part-of-production-site">What to do with alarms when node is not in production and is part of production site?&lt;/h2>
&lt;p>It often happens that testing nodes on production sites are set as
non-production. In such case Nagios monitoring system will send information
about all nodes. As a result ROD will see on their dashboard alarms for
non-production node. If it necessary to monitor such testing node it is
recommended to put such non-production node in downtime.&lt;/p>
&lt;h2 id="what-to-do-when-a-sites-have-multiple-alarmsticket">What to do when a sites have multiple alarms/ticket?&lt;/h2>
&lt;p>When opening a ticket against a site with existing tickets ROD should consider
that these problems may be linked or dependant on pending solutions. In such
case ROD should use grouping mechanism to gather and assign alarms to one ticket
rather than open a ticket for each alarm.&lt;/p>
&lt;p>If the problem is different but maybe linked the expiry dates for each ticket
should be synchronized to the latest date.&lt;/p>
&lt;h2 id="how-to-handle-issues-for-sitenode-in-downtime">How to handle issues for site/node in downtime?&lt;/h2>
&lt;h3 id="handling-tickets-for-sitenode-in-downtime">Handling tickets for site/node in downtime&lt;/h3>
&lt;p>When a ticket has been raised against a site that subsequently enters in
downtime, the expiry date on the ticket can be extended.&lt;/p>
&lt;p>Sites that are in downtime will still have monitoring switched on and therefore
may appear to be failing tests but no alarms on Operations Portal will be raised
against them. ROD must take care that when opening tickets to ensure that they
don&amp;rsquo;t open tickets against sites in downtime.&lt;/p>
&lt;h3 id="handling-alarms-for-sitenode-in-downtime">Handling alarms for site/node in downtime&lt;/h3>
&lt;p>It often happens that a failure occurred generating a lot of alarms and then
site manager decided to put site in Downtime. Getting these alarms OK may take
more than 72h when the issue is escalated to Operations.&lt;/p>
&lt;p>ROD should not create a ticket for sites/nodes in Downtime and is not obligated
to deal with such alarms but it is recommended to close these alarms to avoid
being escalated to Operations. In such case as a reason of closing NON-OK alarm
ROD should put link to the downtime in the
&lt;a href="../../../internal/configuration-database/downtimes">EGI Configuration Database&lt;/a>.&lt;/p>
&lt;h3 id="site-in-downtime-for-more-than-a-month">Site in downtime for more than a month&lt;/h3>
&lt;p>If a site is in DOWNTIME for more than a month then it is advised that the site
should go to the suspended status.&lt;/p>
&lt;h2 id="what-to-do-in-case-of-accounting-issue">What to do in case of accounting issue?&lt;/h2>
&lt;p>In case of problems with accounting it is not recommended to suggest downtime at
the second step of the escalation process for this test. Accounting service is
not a functionality which is critical for users but it still need to be follow
up.&lt;/p>
&lt;h2 id="watch-out-for-flapping-states">Watch out for flapping states&lt;/h2>
&lt;p>You may want to wait for a second test to be run before closing an alarm which
is in an OK status. This ensures that the OK result for that tests is stable.
The waiting period is, of course, dependent on how long the test takes and how
frequently it is checked.&lt;/p>
&lt;h2 id="how-to-handle-the-euegilowavailability-alarm">How to handle the eu.egi.lowAvailability alarm?&lt;/h2>
&lt;p>Go to procedure
&lt;a href="https://go.egi.eu/proc04">PROC04 Quality verification of monthly availability and reliability statistics&lt;/a>.&lt;/p></description></item></channel></rss>