Modelling¶
Application run by Cloudify/Croupier must be defined in a TOSCA file(s) - The Blueprint. A blueprint is typically composed by a header, an inputs section, a node_templates section, and an outputs section. Optinally can have a node_types section.
Tip
Example blueprints can be found at the Croupier resources repository.
Header¶
The header include the TOSCA version used and other imports. In Croupier the Cloudify 1.1.3 tosca version, built-in types and the croupier are mandatory:
tosca_definitions_version: cloudify_dsl_1_3
imports:
# to speed things up, it is possible to download this file,
- http://raw.githubusercontent.com/ari-apc-lab/croupier/master/resources/types/cfy_types.yaml
# Croupier pluging
- http://raw.githubusercontent.com/ari-apc-lab/croupier/master/plugin.yaml
# Openstack plugin (Optional)
- http://www.getcloudify.org/spec/openstack-plugin/2.14.7/plugin.yaml
# The blueprint can be composed by multiple files, in this case we split the inputs section (Optional)
- inputs-def.yaml
Other TOSCA files can also be imported in the inports
list to compose a blueprint made of more than one file. See Advanced: Node Types for more info.
Inputs¶
In this section is where it is defined all the inputs that the blueprint need. These then can be passed as an argument list in the CLI, or prefereably by an inputs file. An input can define a default value. (See the CLI docs and the files inputs-def and local-blueprint-inputs-example.yaml in the examples).
inputs:
hpc_base_dir:
description: HPC working directory
default: $HOME
partition_name:
default: thinnodes
In the example above, two inputs are defined:
hpc_base_dir
as the base working directory, $HOME by default.partition_name
as the partition to be used in an HPC, _thinnodes_ by default.
Node Templates¶
In the node_templates
section is where your application is actually defined, by stablishing nodes and relations between them.
To begin with, every node is identified by its name (hpc_interface
in the example below), and a type is assigned to it.
Infrastructure Interface example.
node_templates:
hpc_interface:
type: croupier.nodes.InfrastructureInterface
properties:
config: { get_input: hpc_interface_config }
credentials: { get_input: hpc_interface_credentials }
external_monitor_entrypoint: { get_input: monitor_entrypoint }
job_prefix: { get_input: job_prefix }
base_dir: { get_input: "hpc_base_dir" }
monitor_period: 15
workdir_prefix: "single"
The example above represents a infrastructure interface, with type croupier.nodes.InfrastructureInterface. All computing infrastructures must have a infrastructure interface defined (_Slurm_ or _Torque_ for HPC supported, plain _SHELL_ for Cloud VMs). Then the WM is configured using the inputs (using fuction get_input). Detailed information about how to configure the HPCs is in the Plugin specification section.
The following code uses hpc_interface
to describe four jobs that should run in the hpc that represents the node. Two of them are of type croupier.nodes.SingularityJob
which means that the job will run using a Singularity container, while the other two of type croupier.nodes.Job describe jobs that are going to run directly in the HPC. Navigate to Croupier plugin types to know more about each parameter.
Four jobs example.
first_job:
type: croupier.nodes.Job
properties:
job_options:
partition: { get_input: partition_name }
commands: ["touch fourth_example_1.test"]
nodes: 1
tasks: 1
tasks_per_node: 1
max_time: "00:01:00"
skip_cleanup: True
relationships:
- type: task_managed_by_interface
target: hpc_interface
second_parallel_job:
type: croupier.nodes.Job
properties:
job_options:
partition: { get_input: partition_name }
commands: ["touch fourth_example_2.test"]
nodes: 1
tasks: 1
tasks_per_node: 1
max_time: "00:01:00"
skip_cleanup: True
relationships:
- type: task_managed_by_interface
target: hpc_interface
- type: job_depends_on
target: first_job
third_parallel_job:
type: croupier.nodes.Job
properties:
job_options:
script: "touch.script"
arguments:
- "fourth_example_3.test"
nodes: 1
tasks: 1
tasks_per_node: 1
max_time: "00:01:00"
partition: { get_input: partition_name }
deployment:
bootstrap: "scripts/create_script.sh"
revert: "scripts/delete_script.sh"
inputs:
- "script_"
skip_cleanup: True
relationships:
- type: task_managed_by_interface
target: hpc_interface
- type: job_depends_on
target: first_job
fourth_job:
type: croupier.nodes.Job
properties:
job_options:
script: "touch.script"
arguments:
- "fourth_example_4.test"
nodes: 1
tasks: 1
tasks_per_node: 1
max_time: "00:01:00"
partition: { get_input: partition_name }
deployment:
bootstrap: "scripts/create_script.sh"
revert: "scripts/delete_script.sh"
inputs:
- "script_"
skip_cleanup: True
relationships:
- type: task_managed_by_interface
target: hpc_interface
- type: job_depends_on
target: second_parallel_job
- type: job_depends_on
target: third_parallel_job
Finally, jobs have two main types of relationships: task_managed_by_interface, to stablish which infrastructure interface will run the job, and job_depends_on, to describe the dependency between jobs. In the example above, fourth_job depends on three_parallel_job and second_parallel_job, so it will not execute until the other two have finished. In the same way, three_parallel_job and second_parallel_job depends on first_job, so they will run in parallel once the first job is finished. All jobs are contained in hpc_interface, so they will run on the HPC using the credentials provided. A third one, interface_contained_in is used to link the Infrastructure Interface to other Cloudify plugins, sush as Openstack. See relationships for more information.
Outputs¶
The last section, outputs
, helps to publish different attributes of each node that can be retrieved after the install workflow of the blueprint has finished (See Execution).
Each output has a name, a description, and value.
- outputs:
- first_job_name:
- description: first job name value: { get_attribute: [first_job, job_name] }
- second_job_name:
- description: second job name value: { get_attribute: [second_parallel_job, job_name] }
- third_job_name:
- description: third job name value: { get_attribute: [third_parallel_job, job_name] }
- fourth_job_name:
- description: fourth job name value: { get_attribute: [fourth_job, job_name] }
Advanced: Node Types¶
Similarly to how node_templates are defined, new node types can be defined to be used as types. Usually these types are going to be defined in a separate file and imported in the blueprint through the import keyword in the header section, although they can be in the same file.
Framework example.
node_types:
croupier.nodes.fenics_iter:
derived_from: croupier.nodes.Job
properties:
iter_number:
description: Iteration index (two digits string)
job_options:
default:
pre:
- 'module load gcc/5.3.0'
- 'module load impi'
- 'module load petsc'
- 'module load parmetis'
- 'module load zlib'
script: "$HOME/wing_minimal/fenics-hpc_hpfem/unicorn-minimal/nautilus/fenics_iter.script"
arguments:
- { get_property: [SELF, iter_number] }
croupier.nodes.fenics_post:
derived_from: croupier.nodes.Job
properties:
iter_number:
description: Iteration index (two digits string)
file:
description: Input file for dolfin-post postprocessing
job_options:
default:
pre:
- 'module load gcc/5.3.0'
- 'module load impi'
- 'module load petsc'
- 'module load parmetis'
- 'module load zlib'
script: "$HOME/wing_minimal/fenics-hpc_hpfem/unicorn-minimal/nautilus/post.script"
arguments:
- { get_property: [SELF, iter_number] }
Above there is dummy example of two new types of the FEniCS framework, derived from croupier.nodes.Job
.
The first type, croupier.nodes.fenics_iter
, defines an iteration of the
FEniCS framework. A new property has been defined, iter_number
, with a
description and no default value (so it is mandatory). Besides the
job_options
property default value has been overriden with a concrete list
of modules, script and arguments.
The second type, croupier.nodes.fenics_post
, described a simulated
postprocessing operation of FEniCS, defining again the iter_number
property
and another one file
. Finally the job options default value has been
overriden with a list of modules, script and arguments.
Note
The arguments reference the built-in function
get_property
. This allows the orchestrator to compose the arguments based on other properties. To see all the functions available, check the Cloudify intrinsic functions.
Execution¶
Execution of an application is performed through the CLI docs in your local machine or a host of your own.
Steps¶
Upload the blueprint
Before doing anything, the blueprint we want to execute needs to be uploaded in the orchestrator with an assigned name.
cfy blueprints upload -b [BLUEPRINT-NAME] [BLUEPRINT-FILE].yaml
Create a deployment
Once we have a blueprint installed, we create a deployment, which is a blueprint with an input file attached. This is usefull to have the same blueprint that represents the application, with different configurations (deployments). A name has to be assigned to it as well.
cfy deployments create -b [BLUEPRINT-NAME] -i [INPUTS-FILE].yaml --skip-plugins-validation [DEPLOYMENT-NAME]
Note
--skip-plugins-validation
is mandatory as we want that the orchestrator download the plugin from a source location (GitHub in our case). This is for testing purposes, and will be removed in future releases.Install a deployment
Install workflow puts everything in place to run the application. Usual tasks in this workflow are data movements, binary downloads, HPC configuration, etc.
cfy executions start -d [DEPLOYMENT-NAME] install
Run the application
Finally to start the execution we run the
run_jobs
workflow to start sending jobs to the different infrastructures. The execution can be followed in the output.cfy executions start -d [DEPLOYMENT-NAME] run_jobs
Note
The CLI has a timeout of 900 seconds, which normally is not enough time for an application to finish. However, if the CLI timeout, the execution will still be running on the MSOOrchestrator. To follow the execution just follow the instructions in the output.
Revert previous Steps¶
The following revert the steps above, in order to uninstall the application, recreate the deployment with new inputs, or remove the blueprint (and possibly upload an updated one), follow the following steps.
Uninstall a deployment
On the contraty of the install workflow, in this case the orchestrator is tipically goint to perform the revert operation of install, by deleting execution files or moving data to an external location.
cfy executions start -d [DEPLOYMENT-NAME] uninstall -p ignore_failure=true
Note
The
ignore_failure
parameter is optional, to perform the uninstall even if an error occurs.Remove a deployment
cfy deployments delete [DEPLOYMENT-NAME]
Remove a blueprint
cfy blueprints delete [BLUEPRINT-NAME]
Troubleshooting¶
If an error occurs the revert steps can be followed revert the last steps made. However there are sometimes when the execution is stucked, or you want simply to cancel a runnin execution, or clear a blueprint or deployment that can be uninstall for whatever the reason. The following commands help you resolve these kind of situations.
See executions list and status
cfy executions list
Check one execution status
cfy executions get [EXECUTION-ID]
Cancel a running (started) execution
cfy executions cancel [EXECUTION-ID]
Hard remove a deployment with all its executions and living nodes
cfy deployments delete [DEPLOYMENT-NAME] -f