BonFIRE logo and link to main BonFIRE site

Table Of Contents

Previous topic

BonFIRE Internal DNS

Next topic

Managing Data Storage for Monitoring

This Page

Overview of Monitoring Options in BonFIRE

The monitoring solution offered by BonFIRE is based on the open source monitoring software Zabbix, with the exception of timestamping. The software comprises two major software components: Zabbix server and Zabbix agent. The server is referred to as ‘aggregator’ in BonFIRE, which would be deployed in a separate resource, whilst the agent resides in the deployed resource image (one per resource image). The Zabbix aggregator collects monitoring information reported by the Zabbix agents in a database. The database can be stored either inside the aggregator image or stored into an external storage resource that is attached to the aggregator as an additional, external disk.

BonFIRE monitoring System enables experimenters to store monitoring data in a flexible way. The data storage size can be offered on-demand, and thus, the experimenter can get the desired storage size. Furthermore, depending on the experiment setup, monitoring data is not only available while experimenting but can even be maintained persistent after the deletion or the expiration of the experiment.

The BonFIRE API offers an abstraction to access the data stored in the ‘aggregator’.

VM Monitoring

Many metrics are being monitored in VM level. Over than 100 metrics are being monitored by default and can be activated or deactivated on request. Some of these, but not limited to, per VM are: total memory, used memory, free memory, total swap memory, used swap memory, free swap memory, total storage, used storage, free storage, CPU load, CPU utilization (%), CPU count, network metrics (e.g. incoming and outgoing traffic in interfaces, OS-related metrics, processes-related metrics (e.g. number of running processes), services-related metrics (e.g. FTP-/Email-/SSH-server is running), etc. On the page of how to set up monitoring you can find more information and guidelines.

Application Monitoring

Application monitoring addresses metrics which provide information about the state of the running application, its performance and other application specific information. These metrics are not provided by default by Zabbix, the experiments needs to explicitly configure them at the agent and the server.

Infrastructure Monitoring

BonFIRE provides its experimenters the ability to get monitoring data about the physical machines that run their VMs. We refer to this service as infrastructure monitoring. Most requested infrastructure metrics are: CPU load, total and free memory, free swap memory, the number of VMs on a physical node, disk IO, disk read/write, incoming and outgoing traffics on interfaces, energy usage and CO2 estimation, etc.

Other Observability Options

The BonFIRE OpenNebula (ONE) testbed sites (EPCC, Inria, HLRS, PSNC) provide experimenters with the deployment log files of their VMs and the system status of the worker nodes so that the experimenters can check the status of their VMs in real time and can have an overview about the current workload of the infrastructure.

ONE VM Logs

The deployment log files of each VM can be exposed by experimenters. There are two possibilities to get the log files:

  • Using the BonFIRE Portal
  • Using the URL at each testbed site:
Testbed VM logs at each testbed
HLRS http://nebulosus.rus.uni-stuttgart.de/logs/{vm-id}.log
Inria http://frontend.bonfire.grid5000.fr/logs/{vm-id}.log
EPCC http://bonfire.epcc.ed.ac.uk/logs/{vm-id}/vm.log
PSNC http://bonfire.psnc.pl/logs/{vm-id}/vm.log

ONE Status Pages

An overview of the current workload about an infrastructure can help plan and schedule future experiments. Two versions of worker node status are provided:

  • .txt format: it is human-readable version
  • .xml format: it is machine-readable version
Testbed worker node status at each testbed (txt) worker node status at each testbed (xml)
HLRS http://nebulosus.rus.uni-stuttgart.de/one-status.txt http://nebulosus.rus.uni-stuttgart.de/one-status.xml
EPCC http://bonfire.epcc.ed.ac.uk/one-status.txt http://bonfire.epcc.ed.ac.uk/one-status.xml
Inria http://frontend.bonfire.grid5000.fr/one-status.txt http://frontend.bonfire.grid5000.fr/one-status.xml
PSNC http://bonfire.psnc.pl/one-status.txt http://bonfire.psnc.pl/one-status.xml

Virtual Wall Status page

An overview of the current workload about Virtual Wall at iMinds can be found: http://ssh.be-ibbt.bonfire-project.eu/production/availability

Timestamping

BonFIRE exposes to experimenters the times at which certain events occur in the BonFIRE stack.

Experiment States

The following experiment states have been defined:

../_images/CESGA_Timestamps.jpg

Please note the following:

  • Requested, Accepted, Queued, Deploying and Deployed states only exist for managed experiments, ie submitted via the Experiment Manager

  • “Ready” can be calculated by establishing the time the last VM in an experiment goes into the RUNNING state

    Attention is drawn to the case of elasticity-generated VMs; the experimenters should decide whether and how they will affect the calculation of “Ready”.

Experiment Manager

Starting from BonFIRE Release 3, each component within an experiment is expected to log its own state changes. These logs should be accessible from the computers hosting each component. An exception to this is the Experiment Manager, which logs the state changes and times during deployment to a log file for each managed experiment. On successfully deploying the components within a managed experiment, the Experiment Manager automatically requests that the experiment state be set to RUNNING. The log file for a typical deployment is shown below, demonstrating the QUEUED, DEPLOYABLE, DEPLOYING and DEPLOYED states.

GET api.bonfire-project.eu/managed_experiments/680/log
"2012-08-14 09:46:18:605 UTC: STATUS QUEUED",
"2012-08-14 09:46:18:607 UTC: STATUS DEPLOYABLE",
"2012-08-14 09:46:18:723 UTC: STATUS DEPLOYING",
"2012-08-14 09:46:18:723 UTC: Creating an experiment called My Experiment ",
"2012-08-14 09:46:18:882 UTC: Created broker experiment",
"2012-08-14 09:46:19:134 UTC: Broker experiment URI is: https://api.bonfire-project.eu/experiments/12079",
"2012-08-14 09:46:19:134 UTC: Processing resource BonFIRE WAN",
"2012-08-14 09:46:19:134 UTC: Starting to deploying network BonFIRE WAN to fr-inria.",
"2012-08-14 09:46:19:191 UTC: Looking up network BonFIRE WAN",
"2012-08-14 09:46:21:532 UTC: Deployed resource uri is http://localhost:8000/locations/fr-inria/networks/53",
"2012-08-14 09:46:21:532 UTC: Processing resource BonFIRE WAN",
"2012-08-14 09:46:21:532 UTC: Starting to deploying network BonFIRE WAN to uk-epcc.",
"2012-08-14 09:46:21:732 UTC: Looking up network BonFIRE WAN",
"2012-08-14 09:46:23:495 UTC: Deployed resource uri is http://localhost:8000/locations/uk-epcc/networks/2",
"2012-08-14 09:46:23:495 UTC: Processing resource BonFIRE Debian Squeeze v3",
"2012-08-14 09:46:23:496 UTC: Starting to deploy storage BonFIRE Debian Squeeze v3 to fr-inria.",
"2012-08-14 09:46:23:579 UTC: Looking up storage BonFIRE Debian Squeeze v3.",
"2012-08-14 09:46:26:058 UTC: Deployed resource uri is http://localhost:8000/locations/fr-inria/storages/1",
"2012-08-14 09:46:26:058 UTC: Processing resource BonFIRE Debian Squeeze v3",
"2012-08-14 09:46:26:059 UTC: Starting to deploy storage BonFIRE Debian Squeeze v3 to uk-epcc.",
"2012-08-14 09:46:26:153 UTC: Looking up storage BonFIRE Debian Squeeze v3.",
"2012-08-14 09:46:28:365 UTC: Deployed resource uri is http://localhost:8000/locations/uk-epcc/storages/118",
"2012-08-14 09:46:28:365 UTC: Processing resource Server",
"2012-08-14 09:46:28:365 UTC: Starting to deploying compute Server to uk-epcc.",
"2012-08-14 09:46:28:475 UTC: Creating compute Server.",
"2012-08-14 09:46:31:752 UTC: Deployed resource uri is http://localhost:8000/locations/uk-epcc/computes/12727",
"2012-08-14 09:46:31:752 UTC: Processing resource Client",
"2012-08-14 09:46:31:752 UTC: Starting to deploying compute Client to fr-inria.",
"2012-08-14 09:46:31:839 UTC: Context ServerIP => 172.18.6.149",
"2012-08-14 09:46:31:840 UTC: Creating compute Client.",
"2012-08-14 09:46:34:269 UTC: Deployed resource uri is http://localhost:8000/locations/fr-inria/computes/24414",
"2012-08-14 09:46:34:269 UTC: All of the resources have been deployed.",
"2012-08-14 09:46:34:460 UTC: STATUS DEPLOYED",
"2012-08-14 09:46:34:621 UTC: A request has been sent to set the experiment status to RUNNING"

Resource states

BonFIRE resource states are exposed to the users. These data ara available from the VM logs and from the Experiment Message Queue. Additionally, the Resource Manager keeps a log for each experiment. The following example illustrates the command and format:

GET api.bonfire-project.eu/experiments/60955/events
<?xml version="1.0" encoding="UTF-8"?>
<events>
  <event>
    <experiment_id>60955</experiment_id>
    <kind>compute</kind>
    <status>created</status>
    <path>/locations/uk-epcc/computes/34758</path>
    <timestamp>2014-03-20 16:42:34 UTC</timestamp>
  </event>
  <event>
    <experiment_id>60955</experiment_id>
    <kind>compute</kind>
    <status>destroyed</status>
    <path>/locations/uk-epcc/computes/34758</path>
    <timestamp>2014-03-20 17:02:27 UTC</timestamp>
  </event>
  <event>
    <experiment_id>60955</experiment_id>
    <kind>experiment</kind>
    <status>terminated</status>
    <path>/experiments/60955</path>
    <timestamp>2014-03-20 17:02:45 UTC</timestamp>
  </event>
</events>

These events are also available from the Monitoring tab of the experiment on the Portal, as shown in the following figure.

../_images/EventLog.jpg

It is worth noting that these logs are available even after the experiment has been deleted.