|Title||MAN05 top and site BDII High Availability|
|Policy Group||Operations Management Board (OMB)|
|Procedure Statement||Deploying top or site BDII service in High Availability|
This document objective is to provide guidelines to improve the availability of the information system, addressing three main areas:
BDII_RAM_DISK=yesin your YAIM configuration, it’s advisable to have 4GB of RAM.
The data management tools (
lcg_utils) contact the information system for
every operation (
lcg-cp, …). So, if you have your client
properly configured with redundancy for the information system, the
lcg_utils tools will use that mechanism in a transparent way. Be aware that
lcg-infosites doesn’t work with multiple BDIIs. Only
Site administrators should configure their services with this failover mechanism where the first top BDII of the list should be the default top BDII provided by their NGI.
We will provide a short introduction to some of these DNS mechanisms but for further information on specific implementations, please contact your DNS administrator.
Load balancing is a technique to distribute workload evenly across two or more resources. A load balancing method, which does not necessarily require a dedicated software or hardware node, is called round robin DNS.
We can assume that all transactions (queries to top or site BDII generate the same resource load. For an effective load balancing, all top or site BDII instances should have the same hardware configurations. In other case, a load balancing arbiter is needed.
Simple round robin DNS load balancing is easy to deploy. Assuming that there
is a primary DNS server (
dns.domain.tld) where the DNS load balancing will
be implemented, one simply has to add multiple A records mapping the same
hostname to multiple IP addresses under the core.top.domain
DNS zone. It is equally applicable to
# In dns.domain.tld: Add multiple A records mapping the same hostname to multiple IP addresses Zone core.domain.tld topbdii.core.domain.tld IN A x.x.x.x topbdii.core.domain.tld IN A y.y.y.y topbdii.core.domain.tld IN A z.z.z.z
The 3 records are always served as answer but the order of the records will rotate in each DNS query
This does NOT provide fault tolerance against problems in the top or site BDIIs themselves
o=infosysroot for the
UpdateStatsobject. This entry contains a number of metrics relating to the latest update such as the time to update the database and the total number of entries. And example of such entry is shown below.
$ ldapsearch -x -h <TopBDII/siteBDII> -p 2170 -b "o=infosys" (...) dn: Hostname=localhost,o=infosys objectClass: UpdateStats Hostname: lxbra2510.cern.ch FailedDeletes: 0 ModifiedEntries: 4950 DeletedEntries: 1318 UpdateTime: 150 FailedAdds: 603 FailedModifies: 0 TotalEntries: 52702 QueryTime: 8 NewEntries: 603 DBUpdateTime: 11 ReadTime: 0 PluginsTime: 4 ProvidersTime: 113
$ ldapsearch -x -h <TopBDII/siteBDII> -p 2170 -b "o=infosys" + (...) # localhost, infosys dn: Hostname=localhost,o=infosys structuralObjectClass: UpdateStats entryUUID: 09bf40e0-7b23-4992-af55-fd74f036a454 creatorsName: o=infosys createTimestamp: 20110612223435Z entryCSN: 20110615120723.216201Z#000000#000#000000 modifiersName: o=infosys modifyTimestamp: 20110615120723Z entryDN: Hostname=localhost,o=infosys subschemaSubentry: cn=Subschema hasSubordinates: FALSE
|ModifiedEntries||The number of objects to modify|
|DeletedEntries||The number of objects to delete|
|UpdateTime||To total update time in seconds|
|FailedAdds||The number of add statements which failed|
|FailedModifies||The number of modify statements which failed|
|TotalEntries||The total number of entries in the database|
|QueryTime||The time taken to query the database|
|NewEntries||The number of new objects|
|DBUpdateTime||The time taken to update the database in seconds|
|ReadTime||The time taken to read the LDIF sources in seconds|
|PluginsTime||The time taken to run the plugins in seconds|
|ProvidersTime||The time taken to run the information providers in seconds|
Previous BDII metrics can be checked to take a decision regarding the reliability and availability of a top or site BDII instance.
More information is available in gLite-BDII_top Monitoring.
In IGI, the DNS update of the number of instances participating in the DNS round robin mechanism depends on the results provided by a Nagios instance.
When Nagios needs to check the status of a service it will execute a plugin and pass information about what needs to be checked. The plugin verifies the operational state of the service and reports the results back to the Nagios daemon.
Nagios will process the results of the service check and take appropriate action as necessary (e.g. send notifications, run event handlers, etc).
Each instance is checked every 5 minutes. If a failure occurs, Nagios runs the
event handler to restart the BDII service AND remove the instance from the DNS
round robin set using
In IBERGRID, an application (developed by LIP) verifies the health of each top BDII. The application can connect to the DNS servers and remove the “A” records of top BDIIs that become unavailable (non responsive to tests).
The monitoring application (nsupdater) is a simple program that performs tests, and based on their result acts upon DNS entries