..
|Copyright (c) 2019 Atos Spain SA. All rights reserved.
|
|This file is part of Croupier.
|
|Croupier is free software: you can redistribute it and/or modify it
|under the terms of the Apache License, Version 2.0 (the License) License.
|
|THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT ANY WARRANTY OF ANY KIND, EXPRESS OR
|IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT, IN NO EVENT SHALL THE
|AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|LIABILITY, WHETHER IN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT
|OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|SOFTWARE.
|
|See README file for full disclaimer information and LICENSE file for full
|license information in the project root.
|
|@author: Javier Carnero
| Atos Research & Innovation, Atos Spain S.A.
| e-mail: javier.carnero@atos.net
|
|plugin.rst
========================
Croupier Cloudify plugin
========================
.. _requirements:
Requirements
-------------------
- Python version
- 2.7.x
.. _compatibility:
Compatibility
-------------
- `Slurm `__ based HPC by ssh user & key/password.
- `Moab/Torque `__ based HPC by ssh user & key/password.
- Tested with `Openstack plugin
`__.
**Tip**
Example blueprints can be found at the `Croupier resources repository
`__.
.. _configuration:
Configuration
------------------------
The Croupier plugin requires credentials, endpoint and other setup
information in order to authenticate and interact with the computing
infrastructures.
This configuration properties are defined in
*credentials* and *config* properties.
.. _credentials:
.. code:: yaml
credentials:
host: "[HPC-HOST]"
user: "[HPC-SSH-USER]"
private_key: |
----BEGIN RSA PRIVATE KEY----
......
-----END RSA PRIVATE KEY-----
private_key_password: "[PRIVATE-KEY-PASSWORD]"
password: "[HPC-SSH-PASS]"
login_shell: {true|false}
tunnel:
host: ...
...
1. HPC and ssh credentials. At least ``private_key`` or ``password``
must be provided.
a. *tunnel*: Follows the same structure as its parent (credentials),
to connect to the infrastructure through a tunneled SSH connection.
b. *login_shell*: Some systems may require to connect to them using a
login shell. Default ``false``.
.. _config:
.. code:: yaml
config:
country_tz: "Europe/Madrid"
infrastructure_interface: {SLURM|TORQUE|SHELL}
1. *country_tz*: Country Time Zone configured in the the HPC.
2. *infrastructure_interface*: Infrastructure Interface used by the HPC.
..
**Warning**
Only Slurm and Torque are currently accepted as infrastructure interfaces
for HPC.
For cloud providers, SHELL is used as interface.
.. _types:
Types
-----
This section describes the `node
type `__
definitions. Nodes describe resources in your HPC infrastructures. For
more information, see `node
type `__.
.. _croupier_nodes_interface:
croupier.nodes.InfrastructureInterface
--------------------------------------
**Derived From:**
`cloudify.nodes.Compute
`__
Use this type to describe the interface of a computing infrastructure
(HPC or VM)
**Properties:**
- ``config``: type of interface and system time zone, as described in config_.
- ``credentials``: Access credentials, as described in credentials_.
- ``base_dir``: Root directory of the working directory. Default ``$HOME``.
- ``workdir_prefix``: Prefix name of the working directory that will be
created by this interface.
- ``job_prefix``: Job name prefix for the jobs created by this
interface. Default ``cfyhpc``.
- ``monitor_period``: Seconds to check job status. This is necessary
because infrastructure interfaces can be overloaded if asked too much times
in a short period of time. Default ``60``.
- ``skip_cleanup``: True to not clean all files when destroying the
deployment. Default ``False``.
- ``simulate``: If true, it performs a dry run where jobs are not really
executed and simulate that they finish inmediately. Useful for testing.
Default ``False``.
- ``external_monitor_entrypoint``: Entrypoint of the external monitor
that Cloudify will use instead of the internal one.
- ``external_monitor_type``: Type of the monitoring system when using an
external one. Default ``{uri-prometheus}[PROMETHEUS]``.
- ``external_monitor_port``: Port of the monitor when using an external
monitoring system. Default ``:9090``.
- ``external_monitor_orchestrator_port``: Port of the external monitor to
connect with Croupier. Default ``:8079``.
**Example**
This example demonstrates how to describe a SLURM interface on an HPC.
.. code:: yaml
hpc_interface:
type: croupier.nodes.InfrastructureInterface
properties:
credentials:
host: "[HPC-HOST]"
user: "[HPC-SSH-USER]"
password: "[HPC-SSH-PASS]"
config:
country_tz: "Europe/Madrid"
infrastructure_interface: "SLURM"
job_prefix: crp
workdir_prefix: test
...
**Mapped Operations:**
- ``cloudify.interfaces.lifecycle.configure`` Checks that there is a
connection between Cloudify and the infrastructure interface,
and creates a new working directory.
- ``cloudify.interfaces.lifecycle.delete`` Clean up all data generated
by the execution.
- ``cloudify.interfaces.monitoring.start`` If the external monitor
orchestrator is available, sends a notification to start monitoring
the infrastructure.
- ``cloudify.interfaces.monitoring.stop`` If the external monitor
orchestrator is available, sends a notification to end monitoring the
infrastructure.
.. _croupier_nodes_job:
croupier.nodes.Job
------------------
Use this type to describe a job
(a task that will execute on the infrastructure).
**Properties:**
- ``job_options``: Job parameters and needed resources.
- ``pre``: List of commands to be executed before running the job.
Optional.
- ``post``: List of commands to be executed after running the job.
Optional.
- ``partition``: Partition in which the job will be executed. If not
provided, the HPC default will be used.
- ``commands``: List of commands to be executed. Mandatory if `script`
property is not present.
- ``script``: Script to be executed. Mandatory if `commands`
property is not present.
- ``arguments``: List of arguments to be passed to execution command.
Variables must be scaped like `"\\$USER"`
- ``nodes``: Nodes to use in job. Default ``1``.
- ``tasks``: Number of tasks of the job. Default ``1``.
- ``tasks_per_node``: Number of tasks per node. Default ``1``.
- ``max_time``: Set a limit on the total run time of the job
allocation. Mandatory if no script is provided, or if the script does
not define such property.
- ``scale``: Execute in parallel the job N times according to this
property. Only for HPC. Default ``1`` (no scale).
- ``scale_max_in_parallel``: Maximum number of scaled job instances
that can be run in parallel. Only works with scale > ``1``.
Default same as scale.
- ``memory``: Specify the real memory required per node. Different
units can be specified using the suffix [``K|M|G|T``]. Default
value ``""`` lets the infrastructure interface assign the default memory
to the job.
- ``stdout_file``: Define the file where to gather the standard
output of the job. Default value ``""`` sets ``.err``
filename.
- ``stderr_file``: Define the file where to gather the standard
error output. Default value ``""`` sets ``.out``
filename.
- ``mail-user``: Email to receive notification of job state changes.
Default value ``""`` does not send any mail.
- ``mail-type``: Type of event to be notified by mail, can define
several events separated by comma. Valid values
``NONE, BEGIN, END, FAIL, TIME_LIMIT, REQUEUE, ALL``. Default
value ``""`` does not send any mail.
- ``reservation``: Allocate resources for the job from the named
reservation. Default value ``""`` does not allocate from any named
reservation.
- ``qos``: Request a quality of service for the job. Default value
``""`` lets de infrastructure interface assign the default user ``qos``.
- ``deployment``: Scripts to perform deployment operations. Optional.
- ``bootstrap``: Relative path to blueprint to the script that will
be executed in the HPC at the install workflow to bootstrap the
job (like data movements, binary download, etc.)
- ``revert``: Relative path to blueprint to the script that will be
executed in the HPC at the uninstall workflow, reverting the
bootstrap or other clean up operations.
- ``inputs``: List of inputs that will be passed to the scripts when
executed in the HPC.
- ``publish``: A list of outputs to be published after job execution.
Each list item is a dictionary containing:
- ``type``: Type of the external repository to be published. Only
``CKAN`` is supported for now. The rest of the parameters depends
on the type.
- ``type: CKAN``
- ``entrypoint``: ckan entrypoint
- ``api_key``: Individual user ckan api key.
- ``dataset``: Id of the dataset in which the file will be
published.
- ``file_path``: Local path of the output file in the computation
node.
- ``name``: Name used to publish the file in the repository.
- ``description``: Text describing the data file.
- ``skip_cleanup``: Set to true to not clean up orchestrator auxiliar
files. Default ``False``.
**Note**
The variable $CURRENT_WORKDIR is available in all operations and
scripts. It points to the working directory of the execution in the
HPC from the *HOME* directory: ``/home/user/$CURRENT_WORKDIR/``.
**Note**
The variables ``$SCALE_INDEX``, ``$SCALE_COUNT`` and ``$SCALE_MAX``
are available in all commands and inside the scripts where
``# DYNAMIC VARIABLES`` exist (they will be dynamicaly loaded after
this line). They hold, for each job instance, the index, the total
number of instances, and the maximun in parallel respectively.
**Example**
This example demonstrates how to describe a job.
.. code:: yaml
hpc_job:
type: croupier.nodes.Job
properties:
job_options:
partition: { get_input: partition_name }
commands: ["touch job-$SCALE_INDEX.test"]
nodes: 1
tasks: 1
tasks_per_node: 1
max_time: "00:01:00"
scale: 4
skip_cleanup: True
relationships:
- type: task_managed_by_interface
target: hpc_interface
...
This example demonstrates how to describe an script job.
.. code:: yaml
hpc_job:
type: croupier.nodes.Job
properties:
job_options:
script: "touch.script"
arguments:
- "job-\\$SCALE_INDEX.test"
nodes: 1
tasks: 1
tasks_per_node: 1
max_time: "00:01:00"
partition: { get_input: partition_name }
scale: 4
deployment:
bootstrap: "scripts/create_script.sh"
revert: "scripts/delete_script.sh"
inputs:
- "script-"
skip_cleanup: True
relationships:
- type: task_managed_by_interface
target: hpc_interface
...
**Mapped Operations:**
- ``cloudify.interfaces.lifecycle.start`` Send and execute the
bootstrap script.
- ``cloudify.interfaces.lifecycle.stop`` Send and execute the revert
script.
- ``croupier.interfaces.lifecycle.queue`` Queues the job in the HPC.
- ``croupier.interfaces.lifecycle.publish`` Publish outputs outside the HPC.
- ``croupier.interfaces.lifecycle.cleanup`` Clean up operations after job is
finished.
- ``croupier.interfaces.lifecycle.cancel`` Cancels a queued job.
.. _croupier_nodes_singularityjob:
croupier.nodes.SingularityJob
-----------------------------
**Derived From:** croupier_nodes_job_
Use this tipe to describe a job executed from a
`Singularity `__ container.
**Properties:**
- ``job_options``: Job parameters and needed resources.
- ``pre``: List of commands to be executed before running
singularity container. Optional.
- ``post``: List of commands to be executed after running
singularity container. Optional.
- ``image``: `Singularity `__ image
file.
- ``home``: Home volume that will be bind with the image instance
(Optional).
- ``volumes``: List of volumes that will be bind with the image
instance.
- ``partition``: Partition in which the job will be executed. If not
provided, the HPC default will be used.
- ``nodes``: Necessary nodes of the job. 1 by default.
- ``tasks``: Number of tasks of the job. 1 by default.
- ``tasks_per_node``: Number of tasks per node. 1 by default.
- ``max_time``: Set a limit on the total run time of the job
allocation. Mandatory if no script is provided.
- ``scale``: Execute in parallel the job N times according to this
property. Default ``1`` (no scale).
- ``scale_max_in_parallel``: Maximum number of scaled job instances
that can be run in parallel. Only works with scale > ``1``.
Default same as scale.
- ``memory``: Specify the real memory required per node. Different
units can be specified using the suffix [``K|M|G|T``]. Default
value ``""`` lets the infrastructure interface assign the default memory
to the job.
- ``stdout_file``: Define the file where to gather the standard
output of the job. Default value ``""`` sets ``.err``
filename.
- ``stderr_file``: Define the file where to gather the standard
error output. Default value ``""`` sets ``.out``
filename.
- ``mail-user``: Email to receive notification of job state changes.
Default value ``""`` does not send any mail.
- ``mail-type``: Type of event to be notified by mail, can define
several events separated by comma. Valid values
``NONE, BEGIN, END, FAIL, TIME_LIMIT, REQUEUE, ALL``. Default
value ``""`` does not send any mail.
- ``reservation``: Allocate resources for the job from the named
reservation. Default value ``""`` does not allocate from any named
reservation.
- ``qos``: Request a quality of service for the job. Default value
``""`` lets de infrastructure interface assign the default user ``qos``.
- ``deployment``: Optional scripts to perform deployment operations
(bootstrap and revert).
- ``bootstrap``: Relative path to blueprint to the script that will
be executed in the HPC at the install workflow to bootstrap the
job (like image download, data movements, etc.)
- ``revert``: Relative path to blueprint to the script that will be
executed in the HPC at the uninstall workflow, reverting the
bootstrap or other clean up operations (like removing the image).
- ``inputs``: List of inputs that will be passed to the scripts when
executed in the HPC
- ``skip_cleanup``: Set to true to not clean up orchestrator auxiliar
files. Default ``False``.
**Note**
The variable $CURRENT_WORKDIR is available in all operations and
scripts. It points to the working directory of the execution in the
HPC from the *HOME* directory: ``/home/user/$CURRENT_WORKDIR/``.
**Note**
The variables $SCALE_INDEX, $SCALE_COUNT and $SCALE_MAX are available
when scaling, holding for each job instance the index, the total
number of instances, and the maximun in parallel respectively.
**Example**
This example demonstrates how to describe a new job executed in a
`Singularity `__ container.
.. code:: yaml
singularity_job:
type: croupier.nodes.SingularityJob
properties:
job_options:
pre:
- { get_input: mpi_load_command }
- { get_input: singularity_load_command }
partition: { get_input: partition_name }
image: {
concat:
[
{ get_input: singularity_image_storage },
"/",
{ get_input: singularity_image_filename },
],
}
volumes:
- { get_input: scratch_voulume_mount_point }
- { get_input: singularity_mount_point }
commands: ["touch singularity.test"]
nodes: 1
tasks: 1
tasks_per_node: 1
max_time: "00:01:00"
deployment:
bootstrap: "scripts/singularity_bootstrap_example.sh"
revert: "scripts/singularity_revert_example.sh"
inputs:
- { get_input: singularity_image_storage }
- { get_input: singularity_image_filename }
- { get_input: singularity_image_uri }
- { get_input: singularity_load_command }
skip_cleanup: True
relationships:
- type: task_managed_by_interface
target: hpc_interface
...
**Mapped Operations:**
- ``cloudify.interfaces.lifecycle.start`` Send and execute the
bootstrap script.
- ``cloudify.interfaces.lifecycle.stop`` Send and execute the revert
script.
- ``croupier.interfaces.lifecycle.queue`` Queues the job in the HPC.
- ``croupier.interfaces.lifecycle.publish`` Publish outputs outside the HPC.
- ``croupier.interfaces.lifecycle.cleanup`` Clean up operations after job is
finished.
- ``croupier.interfaces.lifecycle.cancel`` Cancels a queued job.
.. _relationships:
Relationships
=============
See the
`relationships `__
section.
The following plugin relationship operations are defined in the HPC
plugin:
- ``task_managed_by_interface`` Sets a croupier_nodes_job_ to be executed
by interface croupier_nodes_interface_.
- ``job_depends_on`` Sets a croupier_nodes_job_ as a dependency of
the target (another croupier_nodes_job_), so the target job
needs to finish before the source can start.
- ``interface_contained_in`` Sets a croupier_nodes_interface_ to be
contained in the specific target (a computing node).
Tests
=====
To run the tests Cloudify CLI has to be installed locally. Example
blueprints can be found at *tests/blueprint* folder and have the
``simulate`` option active by default. Blueprint to be tested can be
changed at *workflows_tests.py* in the *tests* folder.
To run the tests against a real HPC / Monitor system, copy the file
*blueprint-inputs.yaml* to *local-blueprint-inputs.yaml* and edit with
your credentials. Then edit the blueprint commenting the simulate
option, and other parameters as you wish (e.g change the name ft2_node
for your own hpc name). To use the openstack integration, your private
key must be put in the folder *inputs/keys*.
**Note**
*dev-requirements.txt* needs to be installed
(*windev-requirements.txt* for windows):
.. code:: bash
pip install -r dev-requirements.txt
To run the tests, run tox on the root folder
.. code:: bash
tox -e flake8,unit,integration