EDM Prov

The Experiment Data manager (EDM) for Provenance (Prov) is a software component developed by IT Innovation to be able to capture provenance data for experimenters and allowing mechanisms for reasoning about this information (in an ontological sense). It is responsible for collecting information for the experimenters about the experiments and resources they create and use in BonFIRE. It also supports experimenters adding provenance information about their own experiment components, such as services that they may have running and any results their experiments generate.

See below for more information about the reasons for capturing provenance, the ontologies used and software components.

This documentation is for version 0.9-SNAPSHOT of the EDM Prov, which is provded for demonstration purposes in BonFIRE. It supports provenance of the 3 main resources in BonFIRE (compute, storage and delete) as well as COCOMA. It does, however, depend on the Management Message Queue (MMQ), and is therefore not available to experimenters on production as this would expose connection details allowing them to access information about other experimenter’s resources.

Please note that, at the time of writing, the code has not been released, so this documentation does not refer to source code.

Reasons for Capturing Provenance

“Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing.” [W3C PROV]

The reasons for capturing provenance for the experimenters are:

  • Shows to the experimenters how the results were obtained (validation)

  • Helps determine the cause of the results or failures
    • Manual management and analysis of provenance is infeasible for large scale experimentation, which is likely for Cloud and services applications
  • Privacy; capturing the flow of personal data in the system
    • Who has processed it, etc.
  • Trust and reliability of results
    • If the results cannot be substantiated/defended, then the experiments themselves cannot be trusted
  • Repeatability of experiments
    • This may not always be possible for experiments that involve factors that will change, such as human users interacting with a system.

The usefulness of provenance goes beyond merely logging events in a flat file, as is typical of computational systems. A key aim is to provide experimenters with an intelligent facility that can reason about the data to help them determine the causes of certain outcomes in their experiments.

W3C PROV Ontology

W3C PROV is a recent specification that describes provenance, providing a core data model and ontology that is adopted in the EDM Prov. The structures in W3C PROV are illustrated below in Figure 1, comprising:

  • Entities: What we can describe provenance of, which could be physical, digital, conceptual or other kinds of “thing”. For example, an experiment, a VM, software/service, data, or a user.
  • Activities: Acts upon or with entities over a period of time. Can generate new entities or use existing entities.
  • Agents: Are assigned some responsibility for an activity that takes place on an entity. For example, a compute resource (entity) is created by BonFIRE (agent acting on behalf of the experimenter). Provenance can also be collected for agents when modelled as an entity as well.
../_images/w3c_prov_diagram.png

Figure 1: W3C PROV diagram

For more information about W3C PROV, see: http://www.w3.org/2001/sw/wiki/PROV

The W3C PROV ontology can be downloaded from: http://www.w3.org/ns/prov.owl

BonFIRE Ontology

To support provenance in BonFIRE, the W3C PROV ontology has been extended to model domain specific Entities and Agents. As seen in Figure 2, there are several extensions of Entity. Some are in the W3C PROV ontology, namely Bundle, Collection and Plan. The rest are extensions made for BonFIRE, including and Experiment and Resource, which is further extended fror compute, storage and network resources, etc. In Figure 3, we see extensions made to the Agent, where the BonFIRE ontology has added the Experimenter, BonfireAgent and CocomaAgent.

The complete BonFIRE provenance ontology can be viewed by downloading bonfire-prov.owl

../_images/bonfire_prov_entity.png

Figure 2: BonFIRE provenance diagram (Entity extensions)


../_images/bonfire_prov_agent.png

Figure 3: BonFIRE provenance diagram (Agent extensions)

Getting Provenance from BonFIRE

For BonFIRE, there are two sources of provenance events: the Management Message Queue (MMQ) and the Experiment Message Queue (EMQ). On the MMQ, we collect provenance data regarding experiments and resources (compute, storage and network). On the EMQ, we collect provenance of COCOMA. The reason the EMQ is not used for the former is that there is not sufficient provenance information available. A translation of the messages on the MQs to RDF triples (provenance individuals) is required. One message typicaly generates several triples. Some examples are given below in respective sections.

Creating Provenance Individuals

The table below shows an overview of mapping of information from messages on the MQs to RDF individuals, the OWL classes used and fields required from the messages. In the case where the individual or OWL class is prepended with : (semicolon), it is a reference to the BonFIRE Provenance ontology. If it is prepended with prov, it is a reference to the W3C PROV ontology.


Source Individual created (example) OWL classes Fields required
EMQ/MMQ :Experiment/experiments/2764 :Experiment prov:Entity experimentId
EMQ :Compute/locations/fr-inria/computes/43360 :Compute prov:Entity type path
MMQ :Compute/locations/es-wellness/computes/321 :Compute prov:Entity objectId objectType
MMQ :Experimenter_sw :Experimenter prov:Agent userId
EMQ/MMQ COCOMA :Activity_res-mng.experiment.create_1375798876_13f6bc prov:Activity timestamp routingKey
MMQ :Location_occi-bonfire.wtelecom.es prov:Location host
COCOMA :CocomaAgent_f1bf6775bf32a45f023c :CocomaAgent source routingKey

Properties are created by reading eventType(MMQ) or status(EMQ/COCOMA). If they require a time, this is taken from timestamp. Properties currently created are:

prov:Activity   prov:startedAtTime      xds:dateTime
prov:Activity   prov:endedAtTime        xds:dateTime
prov:Entity     prov:wasGeneratedBy     prov:Activity
prov:Entity     prov:generatedAtTime    xds:dateTime
prov:Entity     prov:wasInvalidatedBy   prov:Activity
prov:Entity     prov:invalidatedAtTime  xds:dateTime
prov:Entity     prov:wasInfluencedBy    prov:Activity

MMQ

For MMQ messages, first all the provenance elements are identified as described in the table above. Then, the relationships between individuals are identified and modelled as properties.

This is an example message from the MMQ:

"timestamp":1375800882,
"eventType":"create",
"objectType":"compute",
"objectId":"/locations/es-wellness/computes/321",
"source":"res-mng",
"groupId":"sw",
"userId":"sw",
"objectData":
{
        "name":"COCOMA",
        "cpu":"2",
        "vcpu":null,
        "memory":"2048",
        "cluster":null,
        "host":null,
        "osimage":"/locations/es-wellness/storages/vappTemplate-b4fa90c3-4c65-479a-b19d-31e903df1bfa",
        "disks":[
        {
                "id":"0",
                "storage":"/locations/es-wellness/storages/vappTemplate-b4fa90c3-4c65-479a-b19d-31e903df1bfa",
                "type":"OS",
                "target":"",
                "size":"2097"
        }],

        "networks":[],
        "experimentId":"/experiments/2764"
}

For certain types of events, UPDATE and DELETE, messages don’t contain an experimentId. Therefore, some management of the messages is required to determine what experiment the message is for. Provided that the EDM Prov has captured a CREATE message for the resources created in an experiment, their objectId can be used to match messages that do not have the experiment ID. If the objectId matches an entity for a respective experiment, we derive that the message is for that experiment.

EMQ

EMQ messages don’t have an experimentId at all. Because one always subscribes to a routing key unique to an experiment, all the messages belongs to that experiment only.

This is a regular EMQ message generated in BonFIRE about the status of a resource:

{"type":"compute","status":"created","path":"/locations/fr-inria/computes/43360","timestamp":1374593584}

Apart from resource (experiment, compute, storage and network) related messages, there are also COCOMA messages on the EMQ. They can be treated similarly to normal BonFIRE messages but their agent is :CocomaAgent. This helps the experimenter afterwards to find COCOMA influences in the lifecycle of an entity or experiment.

A COCOMA message looks like this:

{"timestamp": 1375800981.019438, "message": {"Action": "Emulation Created", "EmulationName": "MEM_EMU"}, "source": "Emulation Manager"}

Overview of Software Components

The EDM Prov software is developed in Java 1.6, and comprises several components making up the core framework: EDM Prov Manager, EDM Prov Store and EDM Prov MQ. Additionally there is a BonFIRE Crawler component for collecting provenance information in BonFIRE, relying on the EMQ and MMQ (as discussed above) for provenance information pertaining to experiments and resources (compute, storage and network), as well as BonFIRE components like COCOMA. A high level diagram of the EDM Prov components is given below in Figure 4. The EDM Prov components are coloured in blue, BonFIRE components in gray and in orange depicting components experiments may use to provide provenance data into the EDM Prov via the EDM Prov MQ.

../_images/edm-prov-bonfire-arch-high-level.png

Figure 4: EDM Prov components

EDM Prov Manager

This component has a central and controlling role in the EDM. It is responsible for processing provenance data, dealing with any ontologies that may be used to describe provenance data, and reasoning/storage strategies.

All provenance data is processed asynchronously via an EDM Prov Message Queue (see below) in the form of JSON formatted RDF triples. This allows a generic way in which provenance data can be added, which is important for two particular reasons:

  1. to support any provenance data and ontologies unknown to the system, as experimenters may use different domain-specific ontologies to describe their experiments and data, and
  2. to offer a uniform mechanism for client tools written in different programming languages, running on different operating systems, to connect through to generate provenance data.

The manager does expose the means to add triples via an API call when closely integrated within Java applications. However, internally, these triples are then forwarded to the internal EDM Prov MQ, as to avoid any blocking calls to the EDM.

The manager also does management of ontologies, which can be added at runtime for an experiment. Effectively, in the core EDM Prov, the PROV-O is the core ontology and we add to it the BonFIRE Provenance ontology at run-time in the same way that experimenters may add ontologies at runtime specific to their own experiments. As we’ll return to below, the additional ontologies can also contain reasoning rules.

Another important mechanism that the EDM Prov Manager does is to control the reasoning strategy used. Reasoning serves a couple of important functions when dealing with the provenance data:

  1. helps ensure that the ontology is complete; it may add missing relationships to ensure the ontology is consistent.
  2. can execute reasoning rules that can, for example, help establish what has influenced the experiment results.

If provenance data was merely persisted without reasoning, most free graph or triple stores do not ensure the consistency of the data. One may end up with a fragmented graph, for example, which may be a significant issue when trying to query the data. The reasoning rules are incorporated in the ontologies that experimenters may use. In the BonFIRE provenance ontology, we include one SWRL rule that establishes that VMs running on the same physical host may have influenced each other.

Entity(?x), Entity(?y), Location(?l), atLocation(?x, ?l), atLocation(?y, ?l) -> influenced(?x, ?y), influenced(?y, ?x)

EDM Prov Store

This component is responsible for persisting the provenance data via the EDM Prov Manager. The store implements a generic interface, allowing different storage implementations. The current version of the EDM Prov supports neo4j, which is an open source graph NOSQL database. The EDM Prov has been tested with neo4j version 1.9.2, which can be downloaded from:

http://www.neo4j.org/

To install neo4j, please follow the instructions on their website.

To configure the EDM Prov for your neo4j installation, please modify the config file in the EDM Prov Service, as detailed further below.

EDM Prov MQ

As depicted in Figure 4 and briefly discussed above, provenance data is consumed via an EDM Prov Message Queue (MQ). This allows a generic message exchange with a very popular message broker, RabbitMQ, which supports a long list of development technologies.

The common JSON format has been adopted for describing provenance data on the EDM Prov MQ, each message describing an RDF triple and the ID of the experiment the provenance data is for. For example, the following triple has been derived from a message about an experiment with ID /experiments/297 having been created, specifying that the individual :Experiment/experiments/297 (in the ontology) is of type :Experiment.

experimentId:"/experiments/297",
subject:":Experiment/experiments/297",
predicate:"rdf:type",
object:":Experiment"

The EDMProv has been tested with RabbitMQ version 2.8.4. See further below for configuration options for setting up the EDM Prov MQ.

BonFIRE Crawler

This component is responsible for retrieving provenance information from BonFIRE about experiments and resources that are created, used, deleted, etc., in an experiment (from the MMQ). It is also responsible for retrieving provenance information from COCOMA on the EMQ. The BonFIRE Crawler is written in Java, like the EDM Prov, converts the messages from the EMQ and MMQ into correctly formatted triples that are published to the EDM Prov MQ. See the sections below for further information about the API and configuration options.

EDM Prov Deployment in BonFIRE

There are two web services for the software components introduced above:

  • EDM Prov Service
  • BonFIRE Crawler

The first encapsulates the EDM Prov Manager, EDM Prov Store and the EDM Prov MQ. It is implemented as a SOAP service in Java, deployable on a service container such as Apache Tomcat. Similarly, the BonFIRE Crawler is available as a SAOP service as well. Both have been developed and tested with Apache Tomcat version 7. To download and install, please see the instructions on the Apache website:

http://tomcat.apache.org/download-70.cgi

These web services are used for deployment in BonFIRE; in an EDM Prov VM. The web services are used and can be interacted with via web service clients that can either be generated automatically from the respective service WSDL files, or by using Java clients provided with the EDM Prov and BonFIRE Crawler software.

When the EDM Prov VM is deployed, it will automatically contextualise and initialise according to the experiment it is deployed within. That is, it will read the experiment ID and routing key to subsribe to the EMQ and will use the experiment ID to filter the messages of the MMQ. If the same EDM Prov VM is used to monitor multiple experiments, it is possible to make a call to the BonFIRE Crawler to add listeners to any given experiment. The respective APIs are described below.

Interfaces (API)

The interfaces for each of the software components are described below. There are two jar level components, and two SOAP services. Please note that the EDM Prov Service has a more limited interface than the EDM Prov Manager and Store with respect to querying data.

EDM Prov Manager (jar)

/**
 * Adds a triple to the internal ontology. In most cases, this method will
 * be called repeatedly until all the triples have been added.
 *
 * @param triple A simple object containing subject, property and object of
 * a RDF triple.
 * @param experimentID The ID of the experiment the triple is for.
 * @return Indicates whether the method was successful or not. This is
 * necessary to prevent a broken ontology because of missing triples and
 * offers the possibility to emulate transaction functionality.
 * @throws Exception For any errors with adding the triple.
 */
boolean addTriple(Triple triple, String experimentID) throws Exception;

/**
 * Adds a list of triples to the internal ontology. In most cases, this
 * method will be called repeatedly until all the triples have been added.
 *
 * @param triples A list of simple triple objects containing subject,
 * property and object of a RDF triple.
 * @param experimentID The ID of the experiment the triple is for.
 * @return Indicates whether the method was successful or not. This is
 * necessary to prevent a broken ontology because of missing triples and
 * offers the possibility to emulate transaction functionality.
 * @throws Exception For any errors with adding the triples.
 */
boolean addTriples(List<Triple> triples, String experimentID) throws Exception;

/**
 * Adds details of an ontology that is used, including its URL, name and
 * prefix used.
 *
 * @param ontologyDetails Object with the details of the ontology.
 * @param experimentID The ID of the experiment the ontology is for.
 * @throws Exception For any errors with adding the ontology.
 */
void addOntology(OntologyDetails ontologyDetails, String experimentID) throws Exception;

/**
 * Trigger reasoning.
 *
 * @param experimentID The ID of the experiment the triple is for.
 * @throws UnsupportedOperationException Thrown if the reasoning strategy
 * used does not support ad-hoc reasoning to be triggered.
 * @throws Exception Thrown for any errors when reasoning.
 */
void doReasoning(String experimentID) throws UnsupportedOperationException, Exception;

/**
 * Get an instance of the provenance store to interact directly with the
 * store.
 *
 * @return IProvenanceStore instance.
 */
IProvenanceStore getProvenanceStore();

/**
 * Get a Querier for provenance information.
 *
 * @return Querier instance.
 */
IProvenanceQuerier getProvenanceQuerier();

/**
 * Stops the ProvenanceManager listening to the EDM Prov MQ.
 */
void stop();

/**
 * Adds an ontology listener.
 *
 * @param ontologyListener OntologyListener interface.
 */
void addOntologyListener(IOntologyListener ontologyListener);

EDM Prov Store (jar)

/**
 * Store an ontology (owl, rdf, turtle,...) as per the details (URL, etc)
 * given in the OntologyDetails object. The ontology URL could point to a
 * file on disk or to an online resource.
 *
 * @param ontology Object with the details of the ontology.
 * @throws Exception In case of any errors when storing the ontology.
 */
void storeOntology(OntologyDetails ontology) throws Exception;

/**
 * Imports an ontology from in-memory object and persists in the provenance
 * store.
 *
 * @param ontology In-memory ontology object.
 * @throws Exception In case of loading fails, an exception is thrown.
 */
void storeOntology(OWLOntology ontology) throws Exception;

/**
 * Store a triple.
 *
 * @param triple A triple.
 * @throws Exception In case of any errors when storing.
 */
void storeTriple(Triple triple) throws Exception;

/**
 * Store a set of triples.
 *
 * @param triples A list of triples.
 * @throws Exception In case of any errors when storing.
 */
void storeTriples(List<Triple> triples) throws Exception;

/**
 * Get a Querier for provenance information.
 *
 * @return Querier instance.
 */
IProvenanceQuerier getProvenanceQuerier();

/**
 * Exports the contents of the database into the chosen format at the given
 * location.
 *
 * @param format: the format in which the ontology shall be stored
 * @param destination: the destination of the exported data
 * @throws Exception In case of any errors when exporting the data.
 */
void export(DataExportType format, File destination) throws Exception;

/**
 * Clears all the data in the store.
 *
 * @throws Exception In case of any errors when clearing the store.
 */
void clearStore() throws Exception;

/**
 * Shuts down the store properly closing the connection first.
 */
void shutdown();

EDM Prov Service (SOAP API)

Responsible for ontology management, processing, reasoning and persisting provenance data.

/**
 * Adds a triple to the internal ontology. In most cases, this method will
 * be called repeatedly until all the triples have been added.
 *
 * @param triple A simple object containing subject, property and object of
 * a RDF triple.
 * @param experimentID The ID of the experiment the triple is for.
 * @return Indicates whether the method was successful or not. This is
 * necessary to prevent a broken ontology because of missing triples and
 * offers the possibility to emulate transaction functionality.
 * @throws Exception For any errors with adding the triple.
 */
boolean addTriple(Triple triple, String experimentID) throws Exception;

/**
 * Adds a triple to the internal ontology. In most cases, this method will
 * be called repeatedly until all the triples have been added.
 *
 * @param triples A list of simple triple objects containing subject,
 * property and object of a RDF triple.
 * @param experimentID The ID of the experiment the triple is for.
 * @return Indicates whether the method was successful or not. This is
 * necessary to prevent a broken ontology because of missing triples and
 * offers the possibility to emulate transaction functionality.
 * @throws Exception For any errors with adding the triples.
 */
boolean addTriples(List<Triple> triples, String experimentID) throws Exception;

/**
 * Adds details of an ontology that is used, including its URL, name and
 * prefix used.
 *
 * @param ontologyDetails Object with the details of the ontology.
 * @param experimentID The ID of the experiment the ontology is for.
 * @throws Exception For any errors with adding the ontology.
 */
void addOntology(OntologyDetails ontologyDetails, String experimentID) throws Exception;

/**
 * Trigger reasoning for a given experiment.
 *
 * @param experimentID The ID of the experiment the triple is for.
 * @throws UnsupportedOperationException Thrown if the reasoning strategy
 * used does not support ad-hoc reasoning to be triggered.
 * @throws Exception Thrown for any errors when reasoning.
 */
void doReasoning(String experimentID) throws UnsupportedOperationException, Exception;

BonFIRE Crawler Service (SOAP API)

Responsible for collecting information from BonFIRE and converting this into provenance data for the EDM Prov Service.

/**
 * Adds an EMQ listener for a specific experiment
 *
 * @param experimentID The ID of the experiment
 * @param experimentRoutingKey the experiment's routing key
 */
void addEMQListener(String experimentID, String experimentRoutingKey);

/**
 * Adds a MMQ listener which filters the MMQ for a specific experiment.
 *
 * @param experimentId The ID of the experiment. If it is null, events of
 * all experiments will be caught.
 */
void addMMQListener(String experimentID);

/**
 * Stops the listener to the EMQ that is assumed to have been created for the given
 * experiment ID already.
 * @param experimentID The ID of the experiment.
 */
void stopEMQListner(String experimentID);

/**
 * Stops the listener to the MMQ that is assumed to have been created for the given
 * experiment ID already.
 * @param experimentID The ID of the experiment.
 */
void stopMMQListner(String experimentID);

/**
 * Reads a log file of EMQ events to be turned into provenance triples
 * pushed on the EDM Prov MQ.
 *
 * @param filePath The path to the log file of events to be read.
 */
void readEMQEventsFromFile(String filePath, String experimentId, String experimentRoutingKey);

/**
 * Reads a log file of MMQ events to be turned into provenance triples
 * pushed on the EDM Prov MQ.
 *
 * @param filePath The path to the log file of events to be read.
 * @param experimentId The ID of the experiment.
 */
void readMMQEventsFromFile(String filePath, String experimentID);

Configuration

The EDM Prov Service and BonFIRE Crawler Service can be configured by via properties files in the WEB-INF/classes folders of each respective web service when deployed in Tomcat.

EDM Prov (Manager, Store and MQ)

$TOMCAT_HOME$/webapps/edmProvService-0.9-SNAPSHOT/WEB-INF/classes/edmProvService.properties

Important configuration parameters to note are:

edmprov.store.dbPath: which should give the absolute path to data directory in your neo4j installation directory, e.g., C:/software/neo4j-community-1.9.2/data.

edmprov.manager.reasoningStrategy: this will determine a pre-configured (and currently non-changeable) reasoning strategy to be used. The options are:

NONE: no reasoning will take place, and no ontology will be populated in memory.

AD_HOC: reasoning will take place if the doReasoning(..) method on the Manager is called. An ontology is populated in memory continuously.

# EDM Prov manager & store config #
# ------------------------------- #
edmprov.store.dbPath = path/to/neo4j/data
edmprov.store.dbname = prov.db
edmprov.store.cleanOnStartup = false
# Options: NONE, AD_HOC
edmprov.manager.reasoningStrategy = AD_HOC
edmprov.manager.logTriples = true
edmprov.manager.logDir = edmLogs

# Ontology #
# -------- #
edmprov.ont.provURL = http://www.w3.org/ns/prov
# the BonFIRE provenance ontology
edmprov.ont.bfProvURL = ../resources/bonfire-prov.owl
edmprov.ont.bfProvPrefix = http://www.semanticweb.org/sw/ontologies/2013/4/bonfire-prov
# options for dumping the ontology when reasoning & storing it (only works with the AD_HOC reasoning strategy)
edmprov.ont.dumpOntology = true
edmprov.ont.ontologyDumpDir = ontologies

# Internal RabbitMQ config #
# ------------------------ #
edmprov.mq.exchange = edmprov
edmprov.mq.queue = edmprov
edmprov.mq.virtualHost = /
edmprov.mq.user = guest
edmprov.mq.password = guest
edmprov.mq.host = 127.0.0.1
edmprov.mq.port = 5672
edmprov.mq.routingKey = #

BonFIRE Crawler

$TOMCAT_HOME$/webapps/bonFIRECrawlerService-0.9-SNAPSHOT/WEB-INF/classes/bfCrawler.properties

# Internal RabbitMQ config #
# ------------------------ #
edmprov.mq.exchange = edmprov
edmprov.mq.queue = edmprov
edmprov.mq.virtualHost = /
edmprov.mq.user = guest
edmprov.mq.password = guest
edmprov.mq.host = 127.0.0.1
edmprov.mq.port = 5672
edmprov.mq.routingKey = #

# External RabbitMQ config #
# ------------------------ #
edmprov.mmq.vhost = bonfire
edmprov.mmq.exchange = resourceUsage
edmprov.mmq.user = provenance
edmprov.mmq.password = provenance3367
edmprov.mmq.host = mq.integration.bonfire.grid5000.fr
edmprov.mmq.port = 5672
edmprov.mmq.routingKey = #

edmprov.emq.vhost = bonfire
edmprov.emq.exchange = experiments
edmprov.emq.user = eventReader
edmprov.emq.password = reader1348
edmprov.emq.host = mq.integration.bonfire.grid5000.fr
edmprov.emq.port = 5672

# Logfiles for testing #
# -------------------- #
edmprov.test.logMQs = true
edmprov.test.outputLogDir = mqLogs