Experiment Manager

The experiment manager provides a high level interface to the BonFIRE system that supports the specification of a whole experiment in a single document and controls its subsequent execution. The experiment manager interprets the experiment descriptor (in either OVF or the BonFIRE proprietary JSON format) and executes it through multiple calls to the Resource Manager. Both the OVF and JSON experiment descriptor formats are mapped onto an internal common data model that allows the Experiment Manager implementation to be agnostic to the experiment descriptor format.

The Experiment Manager provides a simple RESTful HTTP interface to allow users to create a managed experiment by uploading an experiment descriptor file. The experiment description is parsed and validated immediately, and the user is notified of the success or failure of this stage. The experiment will be deployed in the background by making successive calls to the Resource Manager, and the user can check the status by doing a HTTP GET on the managed experiment resource. Through the use of GET, the user can also download the experiment log file, which lists messages on the progress of the experiment. The Experiment Manager keeps track of a managed experiment resource, which has a status and a link to the URL of the experiment on the Resource Manager. The managed experiment can also be deleted from the Experiment Manager; this will also delete the experiment on the Resource Manager.

APIs provided

The Experiment Manager provides a REST API.

GET /managed_experiments

Gets the collection of managed experiments that can be see by the user.

POST /managed_experiments

Creates a new managed experiment. The experiment details are defined in the payload in either OVF or JSON format. If successful returns the URL of the newly created resource.

GET /managed_experiments/{id}

Gets the details of the specified managed experiment.

DELETE /managed_experiments/{id}

Deletes the specified managed experiment.

GET /managed_experiments/{id}/log

Returns an XML formatted log file for the managed experiment.

APIs used

The Experiment Manager uses the API of the Resource Manager.

The Experiment Manager can be configured use work with or without the reservation API of the Resource Manager.

The Experiment Manager uses Restfully to communicate with the Resource Manager.

Message queue use

The Experiment Manager does not read from any of the message queues.

The Experiment Manager does not write to any of the message queues.

User Agent in outgoing HTTP messages

The Experiment Manager writes tags outgoing HTTP messages with the bonfire-em identifer and a complete user agent header of bonfire-em/<version>.

Unfortunately these messages are not logged in the usual way at the Resource Manager because communiation between the Experiment Manager and the Resource Manager occurs within the same Tomcat container.

Implementation details

The Experiment Manager consists of a parser for each type of experiment descriptor, a data model to store the data contained in the experiment descriptors, a queue of experiments awaiting deployment, a scheduler to decide where resources should be deployed, a planner to plan the correct order in which to create the resources and an orchestrator to enact the resource deployment through calls to the Resource Manager. This architecture is shown in Experiment Manager Internal Architecture.

../_images/EM-arch.png

Experiment Manager Internal Architecture

The Experiment Manager supports the experiment descriptor in multiple formats. Each experiment descriptor has its own parser, which parses the data into a common data model. The rest of the Experiment Manager code reads from this data model and does not need to know about the experiment descriptor format. After the data model is populated, it is validated to ensure that the data is consistent. The experiment is then added to an internal queue to be processed. At this point an HTTP response is returned to the user detailing any parsing or validation errors, or stating that the experiment descriptor has been accepted.

In the background, the scheduler picks up managed experiments that are waiting to be processed and builds up a runtime model in memory from the data in the data model. Each resource template in the data model may be translated into several resource instances in the runtime model. For example, a compute template that specifies that three instances should be created will be translated into three compute instances by the scheduler. In future, the scheduler will include functionality to decide where resources are to be deployed, if the user has not specified this.

The planner takes the runtime model from the scheduler and outputs a plan that specifies the order in which resources must be created. Internally the planner builds a dependency graph. Some dependencies are explicitly specified by the user, e.g. the user can specify the order in which computes should be created. Others are implicit, e.g. networks and storages must be created before the computes that use them, while certain computes must be created before others if the IP address of one must be known to the others. These dependencies are all captured in the graph, which is then resolved into a list of resources in the order in which they should be deployed.

The orchestrator takes this list of resources and instantiates each of them in turn through calls to the Resource Manager. Any errors during deployment result in cancellation of the entire experiment.

Resources are hence created sequentially within a managed experiment. In order to resolve IP address dependencies the Experiment Manager will wait after creating each compute resource until the IP address for that compute resource is known. For Open Nebula sites this wait will be minimal as the OCCI response contains the IP address of the compute resource. However for other sites such as HP the IP address is not known when the resource is created and the Experiment Manager polls the resource until the IP address is known. This can lead to very slow deployment times for computes at HP. Note that even if the compute’s IP address is required by an IP dependency the Experiment Manager will still wait for the compute resource’s IP address. There is room for significant improvement here to speed up the deployment of resources as HP.

The code is multi-threaded so that multiple experiments can be deployed at the same time.

The Experiment Manager contains code to integrate with the BonFIRE reservation system. Whether the Experiment Manager should integration with the reservation system is controlled by a configuration option.

The Experiment Manager is a web app that is deployed into the same Tomcat container as the Resource Manager.

Workflow

If the Experiment Manager is not using the Resource Manager’s reservation functionality then the Experiment Manager works is as follows:

  • An experiment descriptor (ED) is read in and parsed.
  • Managed experiment state is READY.
  • A managed experiment is created in the EM’s own database, and resource descriptions are created in the EM database for each resource in the ED.
  • A deployment plan is created to ensure that resources are created in the correct order.
  • Managed experiment state is DEPLOYABLE.
  • A runtime experiment is created in the EM’s database, and this is queued for deployment.
  • On start of deployment, managed experiment state is DEPLOYING.
  • When the runtime experiment is deployed, an experiment is created in the Resource Manager (RM) and resources created for that experiment.
  • If any resource request fails then no further resources are deployed for that managed experiment and the managed experiment’s status is set to FAILED.
  • On success Managed experiment state is DEPLOYED.

If the Experiment Manager supports the reservation system but the user does not supply a reservation then the Experiment Manager must create a reservation, from which all resources in the experiment are allocated. The user may specify a start time. If this is not supplied, the reservation system determines when the experiment may start. The workflow is as follows:

  • An experiment descriptor (ED) is read in and parsed.
  • A managed experiment is created in the EM’s own database, and resource descriptions are created in the EM database for each resource in the ED.
  • Managed experiment state is READY.
  • A deployment plan is created to ensure that resources are created in the correct order.
  • Managed experiment state is DEPLOYABLE.
  • At this point the EM must build a set of instance requests from the compute instances in the EM’s database.
  • The EM must then request a new reservation instance, passing in the set of instance requests.
    • If the user has specified a start time, then this is passed in the reservation request.
  • After successfully submitting reservation request, managed experiment state is RESERVED.
  • The reservation id must then be stored in the EM’s database as a property of the managed experiment.
  • A dedicated thread in the EM must then routinely query the EM database for experiments which have a status of RESERVED.
  • If the EM detects that the reservation failed or was terminated, the managed experiment state is set to FAILED and no further deployment of the experiment is attempted.
  • If the reservation is RUNNING, deployment of the management experiment resources can start.
  • Managed experiment state is DEPLOYING.
  • When managed experiment is deployed, for each compute instance EM must find a reserved resource.
    • Look at compute instance location and number of computes
    • Look up reservation resources and find matching instance
    • Store reservation ids for instance in EM database linked to compute instance record
    • For each compute in compute instance definition, pick an unused reservation id from database.
    • Use the reservation id in the compute resource creation request
    • Flag the reservation id as already used in the EM database
  • Assuming no errors during deployment, managed experiment state is DEPLOYED

If the Experiment Manager supports the reservation system and the user supplies a existing resource then the workflow is as below. In this case the EM does not create the reservation itself. It is up to the user to check that supplied reservation resource ids in compute instances are correct and match the contents of the supplied reservation. Also note that we cannot easily support compute instances where min>1 as there is no way for the user to specify reservation resource ids for these computes.

  • An experiment descriptor (ED) is read in and parsed.
  • A managed experiment is created in the EM’s own database, and resource descriptions are created in the EM database for each resource in the ED.
  • Supplied resource id is stored in EM database for the managed experiment.
  • Managed experiment state is READY.
  • A deployment plan is created to ensure that resources are created in the correct order.
  • Managed experiment state is DEPLOYABLE.
  • Look up reservation id to make sure it exists and the user hasnt mistyped it.
  • Managed experiment state is RESERVED.
  • A dedicated thread in the EM must then routinely query the EM database for experiments which have a status of RESERVED.
  • For each experiment with a RESERVED state, the thread must query the associated reservation instance for its state.
  • If the reservation fails, the managed experiment state is set to FAILED
  • If the reservation is RUNNING, deployment of the management experiment resources starts.
  • Managed experiment state is DEPLOYING.
  • For each compute instance, read reservation resource id from compute instance record and pass in resource creation request.
  • Assume it is up to the user to check that supplied reservation resource ids are correct and match the contents of the supplied reservation

Data Model

The Experiment Manager’s data model is shown in Experiment Manager data model.

../_images/EM-dataModel.png

Experiment Manager data model

The data model supports the following features of an initial experiment deployment:

  • Creation of compute, storage and network resources at specified locations;
  • Creation of multiple instances of a VM specification;
  • References to existing resource using name or URI;
  • Specifying ordering of resource creation;
  • Passing key/value pairs to the VM’s contextualization;
  • Specifying application specific monitoring metrics;
  • Specifying IP address dependencies;
  • Specifying experiment name and duration;
  • Automatic creation of an aggregator from a keyword;
  • Specifying user-defined monitoring metrics;
  • Creation of persistent storages for monitoring experiments;
  • Integration with a reservation system;
  • Creation and deployment of site link resources at AutoBAHN; and
  • Creation and deployment of router resources at FEDERICA

All this functionality is supported by the JSON-based experiment descriptor. The OVF experiment descriptor does not support the AutoBAHN and FEDERICA resources. Users have primarily focused on the JSON-based descriptor and hence priority has been given to that format.

Known issues

The code that interacts with the reservation system is known to have some issues. These are:

  • best_effort_allocation is not supported - passing an empty tag appears to cause huge problems with restfully. It results in a reservation which restfully cannot read or list without throwing errors
  • resource_set is not supported - not yet implemented in reservation api
  • automatic deletion of a reservation created by the EM has not been implemented
    • if an experiment deployment fails, the reservation will still remain even if the experiment creation fails
  • deleting a managed experiment when it is already queued with a reservation has not been fully tested
  • needs lots of general robustness testing
  • has only been tested with the reservation manager controlling resources at INRIA and EPCC
  • A full system test should submit an experiment with a future start time, wait for it then check everything has been deployed correctly. I am not aware of anything currently in the system test framework that supports this kind of test with a long wait and polling.
  • experiment duration must still be specified in minutes

Additionally, there are some issues with the interaction between restfully and the reservation api:

  • If there are no reservations in the system, it is not possible to use restfully to create any new ones.
  • restfully appears to be unable to handle an empty tag, so best_effort_allocation cannot be supported. Attempting to do pass an empty best_effort_allocation tag results in the creation of a reservation which restfully cannot list or delete
  • The EM needs to be able to look up reservation details to check the status of reservations for queued experiments by doing a find via restfully. I have noticed that doing a restfully listing returns all the reservation ids, regardless of whether the user created them or not. If there any reservations which the user did not create, the find fails as it attempts to iterate over all the reservations it has found and fails with an authorization error on the 1st reservationw that the user did not create.

Table Of Contents

Previous topic

Portal

Next topic

Resource Manager

This Page