.. |Copyright (c) 2019 Atos Spain SA. All rights reserved. | |This file is part of Croupier. | |Croupier is free software: you can redistribute it and/or modify it |under the terms of the Apache License, Version 2.0 (the License) License. | |THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT ANY WARRANTY OF ANY KIND, EXPRESS OR |IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT, IN NO EVENT SHALL THE |AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |LIABILITY, WHETHER IN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT |OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |SOFTWARE. | |See README file for full disclaimer information and LICENSE file for full |license information in the project root. | |@author: Javier Carnero | Atos Research & Innovation, Atos Spain S.A. | e-mail: javier.carnero@atos.net | |plugin.rst ======================== Croupier Cloudify plugin ======================== .. _requirements: Requirements ------------------- - Python version - 2.7.x .. _compatibility: Compatibility ------------- - `Slurm `__ based HPC by ssh user & key/password. - `Moab/Torque `__ based HPC by ssh user & key/password. - Tested with `Openstack plugin `__. **Tip** Example blueprints can be found at the `Croupier resources repository `__. .. _configuration: Configuration ------------------------ The Croupier plugin requires credentials, endpoint and other setup information in order to authenticate and interact with the computing infrastructures. This configuration properties are defined in *credentials* and *config* properties. .. _credentials: .. code:: yaml credentials: host: "[HPC-HOST]" user: "[HPC-SSH-USER]" private_key: | ----BEGIN RSA PRIVATE KEY---- ...... -----END RSA PRIVATE KEY----- private_key_password: "[PRIVATE-KEY-PASSWORD]" password: "[HPC-SSH-PASS]" login_shell: {true|false} tunnel: host: ... ... 1. HPC and ssh credentials. At least ``private_key`` or ``password`` must be provided. a. *tunnel*: Follows the same structure as its parent (credentials), to connect to the infrastructure through a tunneled SSH connection. b. *login_shell*: Some systems may require to connect to them using a login shell. Default ``false``. .. _config: .. code:: yaml config: country_tz: "Europe/Madrid" infrastructure_interface: {SLURM|TORQUE|SHELL} 1. *country_tz*: Country Time Zone configured in the the HPC. 2. *infrastructure_interface*: Infrastructure Interface used by the HPC. .. **Warning** Only Slurm and Torque are currently accepted as infrastructure interfaces for HPC. For cloud providers, SHELL is used as interface. .. _types: Types ----- This section describes the `node type `__ definitions. Nodes describe resources in your HPC infrastructures. For more information, see `node type `__. .. _croupier_nodes_interface: croupier.nodes.InfrastructureInterface -------------------------------------- **Derived From:** `cloudify.nodes.Compute `__ Use this type to describe the interface of a computing infrastructure (HPC or VM) **Properties:** - ``config``: type of interface and system time zone, as described in config_. - ``credentials``: Access credentials, as described in credentials_. - ``base_dir``: Root directory of the working directory. Default ``$HOME``. - ``workdir_prefix``: Prefix name of the working directory that will be created by this interface. - ``job_prefix``: Job name prefix for the jobs created by this interface. Default ``cfyhpc``. - ``monitor_period``: Seconds to check job status. This is necessary because infrastructure interfaces can be overloaded if asked too much times in a short period of time. Default ``60``. - ``skip_cleanup``: True to not clean all files when destroying the deployment. Default ``False``. - ``simulate``: If true, it performs a dry run where jobs are not really executed and simulate that they finish inmediately. Useful for testing. Default ``False``. - ``external_monitor_entrypoint``: Entrypoint of the external monitor that Cloudify will use instead of the internal one. - ``external_monitor_type``: Type of the monitoring system when using an external one. Default ``{uri-prometheus}[PROMETHEUS]``. - ``external_monitor_port``: Port of the monitor when using an external monitoring system. Default ``:9090``. - ``external_monitor_orchestrator_port``: Port of the external monitor to connect with Croupier. Default ``:8079``. **Example** This example demonstrates how to describe a SLURM interface on an HPC. .. code:: yaml hpc_interface: type: croupier.nodes.InfrastructureInterface properties: credentials: host: "[HPC-HOST]" user: "[HPC-SSH-USER]" password: "[HPC-SSH-PASS]" config: country_tz: "Europe/Madrid" infrastructure_interface: "SLURM" job_prefix: crp workdir_prefix: test ... **Mapped Operations:** - ``cloudify.interfaces.lifecycle.configure`` Checks that there is a connection between Cloudify and the infrastructure interface, and creates a new working directory. - ``cloudify.interfaces.lifecycle.delete`` Clean up all data generated by the execution. - ``cloudify.interfaces.monitoring.start`` If the external monitor orchestrator is available, sends a notification to start monitoring the infrastructure. - ``cloudify.interfaces.monitoring.stop`` If the external monitor orchestrator is available, sends a notification to end monitoring the infrastructure. .. _croupier_nodes_job: croupier.nodes.Job ------------------ Use this type to describe a job (a task that will execute on the infrastructure). **Properties:** - ``job_options``: Job parameters and needed resources. - ``pre``: List of commands to be executed before running the job. Optional. - ``post``: List of commands to be executed after running the job. Optional. - ``partition``: Partition in which the job will be executed. If not provided, the HPC default will be used. - ``commands``: List of commands to be executed. Mandatory if `script` property is not present. - ``script``: Script to be executed. Mandatory if `commands` property is not present. - ``arguments``: List of arguments to be passed to execution command. Variables must be scaped like `"\\$USER"` - ``nodes``: Nodes to use in job. Default ``1``. - ``tasks``: Number of tasks of the job. Default ``1``. - ``tasks_per_node``: Number of tasks per node. Default ``1``. - ``max_time``: Set a limit on the total run time of the job allocation. Mandatory if no script is provided, or if the script does not define such property. - ``scale``: Execute in parallel the job N times according to this property. Only for HPC. Default ``1`` (no scale). - ``scale_max_in_parallel``: Maximum number of scaled job instances that can be run in parallel. Only works with scale > ``1``. Default same as scale. - ``memory``: Specify the real memory required per node. Different units can be specified using the suffix [``K|M|G|T``]. Default value ``""`` lets the infrastructure interface assign the default memory to the job. - ``stdout_file``: Define the file where to gather the standard output of the job. Default value ``""`` sets ``.err`` filename. - ``stderr_file``: Define the file where to gather the standard error output. Default value ``""`` sets ``.out`` filename. - ``mail-user``: Email to receive notification of job state changes. Default value ``""`` does not send any mail. - ``mail-type``: Type of event to be notified by mail, can define several events separated by comma. Valid values ``NONE, BEGIN, END, FAIL, TIME_LIMIT, REQUEUE, ALL``. Default value ``""`` does not send any mail. - ``reservation``: Allocate resources for the job from the named reservation. Default value ``""`` does not allocate from any named reservation. - ``qos``: Request a quality of service for the job. Default value ``""`` lets de infrastructure interface assign the default user ``qos``. - ``deployment``: Scripts to perform deployment operations. Optional. - ``bootstrap``: Relative path to blueprint to the script that will be executed in the HPC at the install workflow to bootstrap the job (like data movements, binary download, etc.) - ``revert``: Relative path to blueprint to the script that will be executed in the HPC at the uninstall workflow, reverting the bootstrap or other clean up operations. - ``inputs``: List of inputs that will be passed to the scripts when executed in the HPC. - ``publish``: A list of outputs to be published after job execution. Each list item is a dictionary containing: - ``type``: Type of the external repository to be published. Only ``CKAN`` is supported for now. The rest of the parameters depends on the type. - ``type: CKAN`` - ``entrypoint``: ckan entrypoint - ``api_key``: Individual user ckan api key. - ``dataset``: Id of the dataset in which the file will be published. - ``file_path``: Local path of the output file in the computation node. - ``name``: Name used to publish the file in the repository. - ``description``: Text describing the data file. - ``skip_cleanup``: Set to true to not clean up orchestrator auxiliar files. Default ``False``. **Note** The variable $CURRENT_WORKDIR is available in all operations and scripts. It points to the working directory of the execution in the HPC from the *HOME* directory: ``/home/user/$CURRENT_WORKDIR/``. **Note** The variables ``$SCALE_INDEX``, ``$SCALE_COUNT`` and ``$SCALE_MAX`` are available in all commands and inside the scripts where ``# DYNAMIC VARIABLES`` exist (they will be dynamicaly loaded after this line). They hold, for each job instance, the index, the total number of instances, and the maximun in parallel respectively. **Example** This example demonstrates how to describe a job. .. code:: yaml hpc_job: type: croupier.nodes.Job properties: job_options: partition: { get_input: partition_name } commands: ["touch job-$SCALE_INDEX.test"] nodes: 1 tasks: 1 tasks_per_node: 1 max_time: "00:01:00" scale: 4 skip_cleanup: True relationships: - type: task_managed_by_interface target: hpc_interface ... This example demonstrates how to describe an script job. .. code:: yaml hpc_job: type: croupier.nodes.Job properties: job_options: script: "touch.script" arguments: - "job-\\$SCALE_INDEX.test" nodes: 1 tasks: 1 tasks_per_node: 1 max_time: "00:01:00" partition: { get_input: partition_name } scale: 4 deployment: bootstrap: "scripts/create_script.sh" revert: "scripts/delete_script.sh" inputs: - "script-" skip_cleanup: True relationships: - type: task_managed_by_interface target: hpc_interface ... **Mapped Operations:** - ``cloudify.interfaces.lifecycle.start`` Send and execute the bootstrap script. - ``cloudify.interfaces.lifecycle.stop`` Send and execute the revert script. - ``croupier.interfaces.lifecycle.queue`` Queues the job in the HPC. - ``croupier.interfaces.lifecycle.publish`` Publish outputs outside the HPC. - ``croupier.interfaces.lifecycle.cleanup`` Clean up operations after job is finished. - ``croupier.interfaces.lifecycle.cancel`` Cancels a queued job. .. _croupier_nodes_singularityjob: croupier.nodes.SingularityJob ----------------------------- **Derived From:** croupier_nodes_job_ Use this tipe to describe a job executed from a `Singularity `__ container. **Properties:** - ``job_options``: Job parameters and needed resources. - ``pre``: List of commands to be executed before running singularity container. Optional. - ``post``: List of commands to be executed after running singularity container. Optional. - ``image``: `Singularity `__ image file. - ``home``: Home volume that will be bind with the image instance (Optional). - ``volumes``: List of volumes that will be bind with the image instance. - ``partition``: Partition in which the job will be executed. If not provided, the HPC default will be used. - ``nodes``: Necessary nodes of the job. 1 by default. - ``tasks``: Number of tasks of the job. 1 by default. - ``tasks_per_node``: Number of tasks per node. 1 by default. - ``max_time``: Set a limit on the total run time of the job allocation. Mandatory if no script is provided. - ``scale``: Execute in parallel the job N times according to this property. Default ``1`` (no scale). - ``scale_max_in_parallel``: Maximum number of scaled job instances that can be run in parallel. Only works with scale > ``1``. Default same as scale. - ``memory``: Specify the real memory required per node. Different units can be specified using the suffix [``K|M|G|T``]. Default value ``""`` lets the infrastructure interface assign the default memory to the job. - ``stdout_file``: Define the file where to gather the standard output of the job. Default value ``""`` sets ``.err`` filename. - ``stderr_file``: Define the file where to gather the standard error output. Default value ``""`` sets ``.out`` filename. - ``mail-user``: Email to receive notification of job state changes. Default value ``""`` does not send any mail. - ``mail-type``: Type of event to be notified by mail, can define several events separated by comma. Valid values ``NONE, BEGIN, END, FAIL, TIME_LIMIT, REQUEUE, ALL``. Default value ``""`` does not send any mail. - ``reservation``: Allocate resources for the job from the named reservation. Default value ``""`` does not allocate from any named reservation. - ``qos``: Request a quality of service for the job. Default value ``""`` lets de infrastructure interface assign the default user ``qos``. - ``deployment``: Optional scripts to perform deployment operations (bootstrap and revert). - ``bootstrap``: Relative path to blueprint to the script that will be executed in the HPC at the install workflow to bootstrap the job (like image download, data movements, etc.) - ``revert``: Relative path to blueprint to the script that will be executed in the HPC at the uninstall workflow, reverting the bootstrap or other clean up operations (like removing the image). - ``inputs``: List of inputs that will be passed to the scripts when executed in the HPC - ``skip_cleanup``: Set to true to not clean up orchestrator auxiliar files. Default ``False``. **Note** The variable $CURRENT_WORKDIR is available in all operations and scripts. It points to the working directory of the execution in the HPC from the *HOME* directory: ``/home/user/$CURRENT_WORKDIR/``. **Note** The variables $SCALE_INDEX, $SCALE_COUNT and $SCALE_MAX are available when scaling, holding for each job instance the index, the total number of instances, and the maximun in parallel respectively. **Example** This example demonstrates how to describe a new job executed in a `Singularity `__ container. .. code:: yaml singularity_job: type: croupier.nodes.SingularityJob properties: job_options: pre: - { get_input: mpi_load_command } - { get_input: singularity_load_command } partition: { get_input: partition_name } image: { concat: [ { get_input: singularity_image_storage }, "/", { get_input: singularity_image_filename }, ], } volumes: - { get_input: scratch_voulume_mount_point } - { get_input: singularity_mount_point } commands: ["touch singularity.test"] nodes: 1 tasks: 1 tasks_per_node: 1 max_time: "00:01:00" deployment: bootstrap: "scripts/singularity_bootstrap_example.sh" revert: "scripts/singularity_revert_example.sh" inputs: - { get_input: singularity_image_storage } - { get_input: singularity_image_filename } - { get_input: singularity_image_uri } - { get_input: singularity_load_command } skip_cleanup: True relationships: - type: task_managed_by_interface target: hpc_interface ... **Mapped Operations:** - ``cloudify.interfaces.lifecycle.start`` Send and execute the bootstrap script. - ``cloudify.interfaces.lifecycle.stop`` Send and execute the revert script. - ``croupier.interfaces.lifecycle.queue`` Queues the job in the HPC. - ``croupier.interfaces.lifecycle.publish`` Publish outputs outside the HPC. - ``croupier.interfaces.lifecycle.cleanup`` Clean up operations after job is finished. - ``croupier.interfaces.lifecycle.cancel`` Cancels a queued job. .. _relationships: Relationships ============= See the `relationships `__ section. The following plugin relationship operations are defined in the HPC plugin: - ``task_managed_by_interface`` Sets a croupier_nodes_job_ to be executed by interface croupier_nodes_interface_. - ``job_depends_on`` Sets a croupier_nodes_job_ as a dependency of the target (another croupier_nodes_job_), so the target job needs to finish before the source can start. - ``interface_contained_in`` Sets a croupier_nodes_interface_ to be contained in the specific target (a computing node). Tests ===== To run the tests Cloudify CLI has to be installed locally. Example blueprints can be found at *tests/blueprint* folder and have the ``simulate`` option active by default. Blueprint to be tested can be changed at *workflows_tests.py* in the *tests* folder. To run the tests against a real HPC / Monitor system, copy the file *blueprint-inputs.yaml* to *local-blueprint-inputs.yaml* and edit with your credentials. Then edit the blueprint commenting the simulate option, and other parameters as you wish (e.g change the name ft2_node for your own hpc name). To use the openstack integration, your private key must be put in the folder *inputs/keys*. **Note** *dev-requirements.txt* needs to be installed (*windev-requirements.txt* for windows): .. code:: bash pip install -r dev-requirements.txt To run the tests, run tox on the root folder .. code:: bash tox -e flake8,unit,integration