Monitoring Core

BonFIRE provides its users with experiment monitoring facilities that support three types of metrics: VM metrics, application metrics and infrastructure metrics. Furthermore, monitoring data can be if-required stored permanently. This data will be available for user during the experiment lifetime and after its expiration. Monitoring data in this case is stored in an external storage with arbitrary, as-required, storage size that is chosen by users.

BonFIRE provided this functionality through the use of the Zabbix open source monitoring software. The Zabbix system adopts a client/server approach where the monitoring aggregator plays the role of the server and monitoring agents are the clients. Experimenters are free to deploy aggregators and agents in what in whatever way they wish but BonFIRE provides explicit support for the pattern where a single monitoring aggregator is deployed for each experiment. This aggregator collects data from several monitoring agents deployed throughout the experiment and possibly also from infrastructure level aggregators deployed at each testbed.

The monitoring agents reports monitoring data to the aggregator. In order to overcome possible accessibility problems due to the use of NAT, BonFIRE is using the active Zabbix agents. In this case the agent is the one which initiates the communication to the aggregator and sends the monitoring data.

The Zabbix software provided a web interface through which the user can view the latest values of the configured metrics and plot the collected values in a graph. This web interface is exposed via the BonFIRE portal which automatically maps the user through to the correct Zabbix aggregator for the experiment being viewed.

Zabbix also provides an API that enables users to fetch monitoring data using JSON-RPC. This API is exposed via the Resource Manager which automatically maps requests to the correct Zabbix aggregator for the experiment.

BonFIRE offers deployable image packages for monitoring. These images represent the virtual machines images with the monitoring software already installed and configured.

The following section describes the dependencies, installation and configuration steps for the aggregator and agent images packages.

BonFIRE Monitoring Images

The monitoring images provided by BonFIRE are debian images. The following sections describe the steps necessary to deploy these images in the BonFIRE infrastructure using OCCI requests. As mentioned before, these images have the Zabbix software already installed and configured. The BonFIRE Monitoring aggregator images include both Zabbix server and Zabbix agent installed, while the rest of images have only Zabbix agent installed.

BonFIRE Aggregator Image

The aggregator has been made available in the form of a dedicated virtual machine image containing an installation of the Zabbix monitoring software. This image is deployed like any other virtual machine image – no further configuration by the experimenter is required. The only requirement for the VM running the aggregator is that it must have an IP address that is reachable from the other VMs in the experiment and by the Resource Manager and Portal. This is necessary to enable the monitoring agents deployed on the individual machines to contact the aggregator and to enable the Resource Manager and Portal to expose the Zabbix API and web interface respectively.

Requirements

The aggregator image has a predefined image_id in the BonFIRE testbed. This id may change during the life time of the infrastructure, but since the image names is always the same; it is easy to identify it by the experimenters. This requirement applies also to the network resources.

Installation

The following OCCI request when sent to the BonFIRE Broker triggers the creation of the aggregator image in the specified site.

<compute xmlns='http://api.bonfire-project.eu/doc/schemas/occi'>
  <name>Monitoring Aggregator</name>
  <description>My experiment monitoring aggregator</description>
  <instance_type>small</instance_type>
  <disk>
    <storage href='/locations/site_name/storages/aggregator_image_id'/>
    <type>OS</type>
    <target>hda</target>
  </disk>
  <nic>
    <network href='/locations/site_name/networks/network_id'/>
  </nic>
  <context>
    <usage>zabbix-agent;zabbix-aggr-extend;infra-monitoring-init;log-MQevents-in-zabbix</usage>
  </context>
  <link href='/locations/site_name' rel='location'/>
</compute>

Users may decide which monitoring services are required while creating the aggregator compute resource. User’s setup is forwarding through the contextualisation element of the OCCI request that is used in the form of simple key-value pairs to pass initialisation configuration values and post-install scripts which can run after the VM has been deployed. In the context element of the OCCI request above, post-install scripts are passed to the aggregator image to be executed after booting the image through the usage variable. This is discussed on more details in the contextualisation usage section below.

BonFIRE Base Image

The BonFIRE Base Image contains the Zabbix agent. It is configured to support configuration of the agent through contextualization information. The Zabbix agent software is also included preinstalled within the images provided by BonFIRE. It needs to be configured with the IP address of the aggregator. This configuration is realized through the contextualization mechanisms of OCCI, will be discussed below. After startup, the agent will register itself with the aggregator, from which point on the agent machine is fully integrated within the experiments monitoring system

Requirements

As described for the Aggregator image, the user needs to identify the allocated image_id to the base image before the installation is performed.

Installation

The following OCCI request when sent to the bonfire broker triggers the creation of the image in the specified site.

<compute xmlns='http://api.bonfire-project.eu/doc/schemas/occi'>
        <name>Monitoring Agent</name>
        <description>My experiment monitoring agent</description>
        <instance_type>small</instance_type>
        <disk>
                <storage href='/locations/site_name/storages/image_id'/>
                <type>OS</type>
                <target>hda</target>
        </disk>
        <nic>
                <network href='/locations/site_name/networks/network_id'/>
        </nic>
        <context>
                <aggregator_ip>172.18.3.156</aggregator_ip>
                <usage>zabbix-agent</usage>
                <metrics><![CDATA[<metric> users,wc -l /etc/passwd|cut -d" " -f1, rate=20, valuetype=3, history=10 </metric>]]></metrics>
        </context>
        <link href='/locations/site_name' rel='location'/>
</compute>

Contextualisation Usage for Requesting Monitoring Services

BonFIRE provides its user multiple monitoring services these are:

  • Compute resource (VM) Monitoring. The deployed compute resources are monitoring. Various metrics are measured such as CPU, memory, storage, network, etc. The aggregator image should include in the Zabbix server a template called BonFIRE Template. This includes more than 100 metrics, only 34 of them are active and the rest are provided as disabled, but can be enabled by users afterwards if they need them. If this service is required, an aggregator compute resource is required.
  • Monitoring user’s applications or services running on these VMs. The experimenter has the ability to further configure the agent by defining personalized metrics (also called custom, user or application metrics) in order for measuring the performance and the behaviour of his applications under test. These metrics will be measured and their data is sent to the aggregator. This can be done through the standard mechanisms of the Zabbix software or via the contextualization section of a BonFIRE OCCI request to create a VM.
  • The user has multiple options on where to store the monitoring data of his experiment. The monitoring data can be stored either inside or outside the aggregator image. In the second option, the database of the aggregator is stored in an external (permanently available if-required) storage that is mounted as an additional disk to the aggregator VM. This option enables more flexibility, the experimenter can set, on-demand, the storage size for the monitoring data, and this data is also available after the experiment’s expiration or deletion. As a third option, the experimenter can re-use an external storage resource that was already used in previously experiment. All these options are available through the BonFIRE Portal. By default the aggregator is created with an external, permanent storage with 1GB size.
  • infrastructure monitoring. The user can get partial information about the physical machines hosting his VMs.
  • Log experiment events into Zabbix. The user can log all events related to his experiment into Zabbix database. Events such as create, stop, suspend, and destroy any resource (compute, network, storage).

The contextualisation usage of the OCCI request to create an aggregator

BonFIRE provides its user multiple tools to use BonFIRE, BonFIRE Portal, Experiment Descriptors, OCCI/API via HTTP(cURL), Command Line Interface Tools, and Restfully. Using any of these the user can create experiments and resources. In order for enjoying monitoring services, an aggregator is required. If the user wants the aggregator compute resource to be monitored, the monitoring agent inside the aggregator should be run; this is done by setting the usage variable of the context element of the OCCI request to zabbix-agent.

If external data storage for monitoring data will be used, it should be created before the creation of the aggregator compute resource. If it is required only to include that data during the experiment lifetime, it should be created outside the experiment. If the storage is created as a part of the experiment resource, it will not be permanently available after the experiment lifetime or deletion, and therefore if the user wants his data to be permanently available, the used external storage should be created outside the experiment. This storage is mounted to the aggregator compute resource as an external storage. In BonFIRE the type of this kind of storage is called datablock. In this case an additional disk should be added to the OCCI request as shown blow:

<disk>
  <storage href='/locations/site_name/storages/datablock_id'/>
  <type>DATABLOCK</type>
  <target>hdb</target>
</disk>

The zabbix database is moved into this storage and the mysql server is configured accordingly. This is done through the contextualisation script called zabbix-aggr-extend that is executed after booting the aggregator compute resource. This script can be passed through the contextualization as seen in the OCCI request above.

BonFIRE users can get monitoring information about the physical machines that host their VMs. This service is activated if the usage variable of the context element contains the infra-monitoring-init script for post installation.

BonFIRE experimenter might need to track all the events of his experiment such as create, delete, stop, suspend, etc. resources, and have all this information stored in the aggregator. This is done through setting the usage variable to log-MQevents-in-zabbix.

All these services are made optional for BonFIRE users. The user while creating an experiment with an aggregator can request any or all of them. The portal provides these optional services in form of check boxes; the user can either check or uncheck any. If the user uses BonFIRE Experiment Descriptor or any other user tools to create experiments he has to be aware on the fact that the usage variable should be set correctly. To this end, there are various setting combinations of the usage variable based on the required monitoring services to be discussed here.

  • If the aggregator does not monitoring itself, and no need for using an external storage and for infrastructure monitoring: the usage variable is not added at all.

  • If the aggregator monitors itself only:
    <usage>zabbix-agent</usage>
    
  • If an external storage is used:
    <usage>zabbix-aggr-extend</usage>
    
  • If infrastructure monitoring is requested:
    <usage>infra-monitoring-init</usage>
    
  • If the aggregator monitors itself and an external storage is used:
    <usage>zabbix-agent;zabbix-aggr-extend</usage>
    
  • If an external storage is used and infrastructure monitoring is requested:
    <usage>zabbix-aggr-extend;infra-monitoring-init</usage>
    
  • If the aggregator monitors itself, an external storage is used, and infrastructure monitoring is requested:
    <usage>zabbix-agent;zabbix-aggr-extend;infra-monitoring-init</usage>
    
  • If log experiment events is requested:
    <usage>zabbix-agent;log-MQevents-in-zabbix</usage>
    

    ((in this case the aggregator should monitor itself to include under its monitored metrics the new logging metric that shows the experiment events through zabbix))

  • If the aggregator monitors itself and log experiment events is requested:
    <usage>zabbix-agent;log-MQevents-in-zabbix</usage>
    
  • If an external storage is used and log experiment events is requested:
    <usage>zabbix-agent;zabbix-aggr-extend;log-MQevents-in-zabbix</usage>
    

    ((as before, the aggregator should monitor itself to include under its monitored metrics the new logging metric that shows the experiment events through zabbix))

  • If infrastructure monitoring and log experiment events are requested:
    <usage>zabbix-agent;infra-monitoring-init;log-MQevents-in-zabbix</usage>
    

    ((as before, the aggregator should monitor itself to include under its monitored metrics the new logging metric that shows the experiment events through zabbix))

  • If the aggregator monitors itself, an external storage is used, and log experiment events is requested:
    <usage>zabbix-agent;zabbix-aggr-extend;log-MQevents-in-zabbix</usage>
    
  • If the aggregator monitors itself, and infrastructure monitoring and log experiment events are requested:
    <usage>zabbix-agent;infra-monitoring-init;log-MQevents-in-zabbix</usage>
    
  • If an external storage is used, and infrastructure monitoring and log experiment events are requested:
    <usage>zabbix-agent;zabbix-aggr-extend;infra-monitoring-init;log-MQevents-in-zabbix</usage>
    
  • If the aggregator monitors itself, an external storage is used, infrastructure monitoring and log experiment events are requested:
    <usage>zabbix-agent;zabbix-aggr-extend;infra-monitoring-init;log-MQevents-in-zabbix</usage>
    

The contextualisation usage of the OCCI request to create a compute resource

Upon the creation of a compute resource (VM), the IP of the aggregator should be identified through the context element of the OCCI request. The usage variable is used in to pass the zabbix-agent script for post-installation. This script is in charge of configuring the preinstalled zabbix agent on the VM with the aggregator_ip, and as well configuring the zabbix agent and zabbix aggregator to include the user metrics that are passed through the metrics variable of the context element.

The context element is set automatically if the experimenter uses BonFIRE Portal to create its resources (the experiment). If the Portal is not used, this kind of configuration is realized manually.

How to deploy Monitoring Contextualisation Scripts

In order for supporting monitoring, the monitoring and the related contextualisation scripts should be available at the testbed site. At OpenNebula sites, these are available in the so-called “ISO image” that includes as well:

context.sh: file that contains configuration variables, filled by OpenNebula with the parameters specified in the VM description file

init.sh: script called by VM at start that will configure specific services for this VM instance

The monitoring scripts are available in the SVN at https://scm.gforge.inria.fr/svn/bonfire-dev/vm-images/branches/production/, as follow:

At …/context/lib/
SubscribeMQ.py
configureServer.py
get-infra-monitoring-data.cfg
get-infra-monitoring-data.py
infra_monitoring
log-MQevents-in-zabbix.py
update-zabbix-pw
zabbix-add-metric
zabbix_api.py

At …/context/common/
infra-monitoring-init
zabbix-agent
log-MQevents-in-zabbix
zabbix-aggr-extend

The role of the init.sh script for monitoring need

  • Copy the contextualisation variables into /etc/default/bonfire

  • Run all scripts that are located at context/distributions/debian/, among these script is the 03_zabbix_pw script that is in charge of updating the Zabbix database password with the experiment one that is created by the BonFIRE API while creating the experiment

  • Copy monitoring files to the intended locations inside the VM as follow:

    cp lib/configureServer.py /usr/local/lib/
    cp lib/zabbix_api.py /usr/local/lib
    cp lib/zabbix-add-metric /usr/local/bin
    chmod 755 /usr/local/bin/zabbix-add-metric
  • Parse the usage variable. If set, execute corresponding scripts. In a normal VM, the usage if set, then only set to zabbix-agent, but in the aggregator VM, it can be set up to four values (script names) as discussed above zabbix-agent;zabbix-aggr-extend;infra-monitoring-init;log-MQevents-in-zabbix. Some of these scripts will copy from the amounted disk further scripts that are needed.

More clarifications about the monitoring scripts

The monitoring is supported in three levels: VM and application/service and infrastructure.

In order to support the VM-level and application/service-level monitoring, the following python and shell scripts should be used: configureServer.py, zabbix_api.py, zabbix-add-metric, zabbix-agent, and zabbix-aggr-extend.

The zabbix-agent is used to run a Zabbix agent on a VM and also reconfigure the agent if new metrics (application/service related metrics) are added by the user to be monitored. These additional metrics are added by the user while creating the VM and are received by the VM through the contextualisation. The Zabbix server should be reconfigured accordingly, by creating and configuring these metrics on the Zabbix Server (BonFIRE aggregator). The zabbix-agent script calls the configureSever.py script and gives it the names and the setting of the metrics as input.

The configureSever.py script is the responsible for reconfiguring the Zabbix Server database. This script uses the Zabbix API (zabbix_api.py) for remote configuration. The zabbix-agent script should be run while booting the VM image.

To enable adding further metrics to a running VM, the zabbix-add-metric script is used. It reconfigures the Zabbix agent at the VM and configures the Zabbix server by calling the configureSever.py script.

The following scripts should be located as follow:

/usr/local/lib/configureServer.py
/usr/local/lib/zabbix_api.py
/usr/local/bin/zabbix-add-metric

Since all monitoring scripts are received through the contextualization, then the following lines should be added to the init.sh file:

cp …/configureServer.py /usr/local/lib/
cp …/zabbix_api.py /usr/local/lib
cp …/zabbix-add-metric /usr/local/bin
chmod 755 /usr/local/bin/zabbix-add-metric

Some contextualisation information is needed (e.g. aggregator_IP, bonfire_uri, bonfire_experiment_id, bonfire_experiment_routing_key, bonfire_experiment_aggregator_password) such information can be obtained form /etc/default/bonfire file, which is a copy of the context.sh script.

In order to use an external data storage to store the database of the Zabbix Server (BonFIRE aggregator), the zabbix-aggr-extend script is used in the BonFIRE aggregator VM. This script should be run while booting the image.

Infrastructure-Level Monitoring:

The infra-monitoring-init script is responsible to start the physical infrastructure monitoring. The script is run while booting the VM image. It copies infrastructure monitoring related scripts (get-infra-monitoring-data.py, get-infra-monitoring-data.cfg, and SubscribeMQ.py) that are received through the contextualisation from the /mnt/* to new places /usr/local/*.

It also copies the infra_monitoring to the /etc/init.d/. This script is executed every time the image is restarted.

The log-MQevents-in-zabbix script is in charge of logging the experiment events into zabbix. It copies the log-MQevents-in-zabbix.py into /usr/local/lib/. This Python script will subscribe to the message queue and get notified once new events occurs. It contacts BonFIRE API to get events that has been created before booting the aggregator VM. This file creates a new log file called mqevents at /var/log/, then stores the experiment events into this file. This file is read by a specific log metric that should be already included in the BonFIRE Template at the aggregator zabbix database. This metric should be included in the BonFIRE Template as follow:

Description: Message queue events
Type: Zabbix agent (active)
Key: log[/var/log/mqevents]
Type of information: Log
Status: Active
Log time format: pppppp:yyyyMMdd:hhmmss
Applications: Log files (or under new application could be called BonFIRE Items?)
For more info on how to in Zabbix: https://www.zabbix.com/documentation/2.0/manual/config/items/itemtypes/log_items

If both the infrastructure monitoring and the log experiment events services are requested (through setting the usage variable to infra-monitoring-init;log-MQevents-in-zabbix), both the SubscribeMQ.py and the log-MQevents-in-zabbix.py will check for each other, if both are running, the log-MQevents-in-zabbix.py will stop and exit. The SubscribeMQ.py will do his job for infrastructure monitoring and as well for logging experiment events service as well. This to avoid two subscriptions to the message queue for the same work.

Dependencies

Zabbix Version 1.8.16 is used. The latest verion of MySQL is used as a database. Monitoring scripts are written in Python, therefore Python (latest version) is required. Infrastructure monitoring scripts require SQLite (latest version).

Libraries used by the monitoring scripts in the aggregator

python-argparse (latest version) xml.dom.minidom (latest version) zabbix_api (zabbix 1.8 version)

Further libraries (latest versions), that are either preinstalled or to be installed by the initialization script of the infrastructure monitoring service that is copied through the contextualisation while booting the VMs:

apt-get install -y python-amqplib
apt-get install -y python-pycurl

Contextualisation (context.sh file)

Monitoring software depends on the context.sh file that contains contextualisation (configuration) variables, filled by OpenNebula with the parameters specified in the VM description file. These variables are set either at the user level (the context element of the OCCI: usage, metrics, aggregator_ip, etc.), at the BonFIRE API (bonfire_credentials, bonfire_uri, bonfire_experiment_id, bonfire_experiment_routing_key, bonfire_experiment_aggregator_password, etc.) or at the testbed site (bonfire_resource_id, bonfire_resource_id, bonfire_resource_name). This file is included in the contextualisation image that is mounted in the created VM while booting. This image includes as well the init.sh script that is called by VM at starting. This script will configure specific services for this VM instance; one of its tasks is to copy the content of the context.sh file into /etc/default/bonfire. This file will include at least the following variables:

# Context variables generated by OpenNebula
AGGREGATOR_IP="172.18.253.188"
AUTHORIZED_KEYS="ssh-rsa xxxxxxx user public key xxxxxxxxxxxxxx"
BONFIRE_CREDENTIALS="user_name:xxxxxxxxxxxx"
BONFIRE_EXPERIMENT_AGGREGATOR_PASSWORD="u637xp"
BONFIRE_EXPERIMENT_EXPIRATION_DATE="1382666283"
BONFIRE_EXPERIMENT_ID="43211"
BONFIRE_EXPERIMENT_ROUTING_KEY="3557e68e22bd4e8e32dc"
BONFIRE_PROVIDER="uk-epcc"
BONFIRE_RESOURCE_ID="24347"
BONFIRE_RESOURCE_NAME="probeMachine"
BONFIRE_URI="https://api.bonfire-project.eu"
DNS_SERVERS="172.18.3.1"
ETH0_GATEWAY="172.18.240.1"
ETH0_IP="172.18.240.106"
ETH0_MASK="255.255.248.0"
FILES="/srv/cloud/context /srv/cloud/context/lib /srv/cloud/context/distributions /srv/cloud/context/sites /srv/cloud/context/common /srv/cloud/context/init.sh"
HOSTNAME="probeMachine-24347"
LOG="http://bonfire.epcc.ed.ac.uk/logs/24347/vm.log"
NTP_SERVERS="129.215.175.126"
TARGET="sdb"
USAGE="zabbix-agent"

The user can log into monitoring data through the Aggregator GUI or to the database through Zabbix APIs using the username Admin and the experiment password. Three ways to find the experiment password: * in the experiment XML description:

<aggregator_password>xxxxxx</aggregator_password>
  • in the XML description of any of the experiment VMs:

    <bonfire_experiment_aggregator_password>xxxxxx</bonfire_experiment_aggregator_password>
    
  • in the context of any of the experiment VMs located inside the VM at /etc/default/bonfire: BONFIRE_EXPERIMENT_AGGREGATOR_PASSWORD=”xxxxxx”.

This password is created by BonFIRE broker one per experiment while creating the experiment. The password is passed through the contextualisation (/etc/default/bonfire) into all VMs. While booting the aggregator VM, the update-zabbix-pw shell script, which is one of the contextualisation scripts mounted while booting the image, is executed by the context/distributions/debian/03_zabbix_pw script that is executed by the init.sh file. The 03_zabbix_pw script checks if we are inside an aggregator image then do the job otherwise exit silently.

BonFIRE API

Contact BonFIRE API through its URI in order to get the IPs of the site aggregators that are responsible for monitoring the physical infrastructure. These are needed by the infrastructure monitoring scripts. In addition to this need, these scripts contact the BonFIRE API to get a list of the experiment events that has been occurred before booting the aggregator image that will contain a listener usually subscribed to the experiment message queue to get notified about the experiment events.

The URI of the BonFIRE integration API is https://api.integration.bonfire.grid5000.fr

The URI of the BonFIRE production API is https://api.bonfire-project.eu

The infrastructure monitoring scripts gets the URI from the contextualisation file at /etc/default/bonfire.

Message Queue Notifications

Subscription to the Experiment Message Queue to get notified about the experiment events. While subscripting gets events per experiment ID. The infra monitoring scripts consumes only notifications about the status of the compute resources (VMs) in order to or not monitoring the physical machines hosting these VMs.

For the infrastructure monitoring software the events should have the following message format:

{
        "type":"compute",
        "status":"created",
        "path":"/locations/uk-epcc/computes/1263",
        "timestamp":1374243305
}

Only events from type compute is considered, if the status is created its hosting physical machine is considered for monitoring, if the status is destroyed, its hosting physical machine will not be further monitored unless other computes (VMs) belonging to the same experiment are running on that physical machine. Events from other types (e.g. storage) or having different message formats will be ignored.

For log experiment events into Zabbix service, all events related to the experiment are logged.

The following message queue information is used:

  • In the BonFIRE integration infrastructure:

    host="mq.integration.bonfire.grid5000.fr:5672",
    userid="eventReader",
    password="reader1348",
    exchange="experiments",
    routing_key=settings['BONFIRE_EXPERIMENT_ROUTING_KEY'] (from the contextualisation file /etc/default/bonfire)
  • In the BonFIRE production infrastructure:

    host="mq.bonfire-project.eu:5672",
    userid="eventReader",
    password="reader1348",
    exchange="experiments",
    routing_key=settings['BONFIRE_EXPERIMENT_ROUTING_KEY'] (from the contextualisation file /etc/default/bonfire)

Infrastructure Monitoring Support at Testbed Level

The infrastructure monitoring service provides partial monitoring information about the underlying infrastructure to Bonfire users.

A bonfire user can get monitoring information about the physical machines hosting its VMs deployed at any site.

Only a small set of metrics are measured and being provided for bonfire user. These are as named in Zabbix as follow:

  • Eth0 outgoing traffic
  • Eth0 incoming traffic
  • Running VMs
  • Processor load
  • Free swap space
  • Total memory
  • Free memory
  • Disk sda Write Bytes/sec
  • Disk sda Write: Ops/second
  • Disk sda IO ms time spent performing IO
  • Disk sda IO currently executing
  • Disk sda Read: Milliseconds spent reading
  • Disk sda Read: Ops/second
  • Disk sda Write: Milliseconds spent writing
  • Disk sda Read Bytes/sec
  • Ping to the server (TCP)

In the earlier releases of bonfire, infrastructure monitoring software was using a static list of metrics that have the same names in all bonfire sites that provide infrastructure monitoring service.

For improving this service and to have more flexibility and optimization demand, starting from release R4, the software was changed to enable a dynamic list of metrics as it’s offered by each site.

For instance, Inria site provides 27 metrics related to the above listed metrics, while EPCC provides 18 metrics, and HLRS only the above listed 16 metrics.

What should be done by each site willing to provide infrastructure monitoring service?

  1. The physical machines (PMs) offered in bonfire to host VMs in both the integration and production infrastructures should be monitoring by a local zabbix server, we refer to it in bonfire with a site aggregator or infrastructure aggregator. It is in charge of monitoring the PMs that should run zabbix agents.
  2. A zabbix guest user in the site aggregator should be created. It has to be named bonfire with a password bonfire. This user should have only read permissions.
  3. An extra template that includes at least the above listed metrics should be created. This template should be called Bonfire_Template. To do so in Zabbix through its GUI go to configuration tab then templates, on the top right of the page you can either create or import the Bonfire_Template template if you have it from any bonfire site admin that provide.
  4. The IP of the site aggregator should be reachable by any Bonfire VM.
  5. This IP should be provided along with other services that the site is offering. Follow the instruction here http://tracker.bonfire-project.eu/issues/379

For instance the IP (172.18.6.3) of the site aggregator at EPCC is listed as you see below when looking for EPCC services:

$ curl -kn https://api.bonfire-project.eu//locations/uk-epcc/services
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://api.bonfire-project.eu/doc/schemas/occi" href="/locations/uk-epcc/services">
  <items offset="0" total="1">
  <service href="/locations/uk-epcc/services/2">
    <name>aggregator</name>
    <ip>172.18.6.3</ip>
    <link rel="parent" href="/locations/uk-epcc" type="application/vnd.bonfire+xml"/>
  </service>
  </items>
  <link href="/locations/uk-epcc" rel="parent" type="application/vnd.bonfire+xml"/>
</collection>