Croupier Cloudify plugin

Requirements

  • Python version
    • 2.7.x

Compatibility

Configuration

The Croupier plugin requires credentials, endpoint and other setup information in order to authenticate and interact with the computing infrastructures.

This configuration properties are defined in credentials and config properties.

credentials:
  host: "[HPC-HOST]"
  user: "[HPC-SSH-USER]"
  private_key: |
    ----BEGIN RSA PRIVATE KEY----
    ......
    -----END RSA PRIVATE KEY-----
  private_key_password: "[PRIVATE-KEY-PASSWORD]"
  password: "[HPC-SSH-PASS]"
  login_shell: {true|false}
  tunnel:
      host: ...
      ...
  1. HPC and ssh credentials. At least private_key or password must be provided.

    1. tunnel: Follows the same structure as its parent (credentials), to connect to the infrastructure through a tunneled SSH connection.

    b. login_shell: Some systems may require to connect to them using a login shell. Default false.

config:
  country_tz: "Europe/Madrid"
  infrastructure_interface: {SLURM|TORQUE|SHELL}
  1. country_tz: Country Time Zone configured in the the HPC.
  2. infrastructure_interface: Infrastructure Interface used by the HPC.

Warning

Only Slurm and Torque are currently accepted as infrastructure interfaces for HPC. For cloud providers, SHELL is used as interface.

Types

This section describes the node type definitions. Nodes describe resources in your HPC infrastructures. For more information, see node type.

croupier.nodes.InfrastructureInterface

Derived From: cloudify.nodes.Compute

Use this type to describe the interface of a computing infrastructure (HPC or VM)

Properties:

  • config: type of interface and system time zone, as described in config.
  • credentials: Access credentials, as described in credentials.
  • base_dir: Root directory of the working directory. Default $HOME.
  • workdir_prefix: Prefix name of the working directory that will be created by this interface.
  • job_prefix: Job name prefix for the jobs created by this interface. Default cfyhpc.
  • monitor_period: Seconds to check job status. This is necessary because infrastructure interfaces can be overloaded if asked too much times in a short period of time. Default 60.
  • skip_cleanup: True to not clean all files when destroying the deployment. Default False.
  • simulate: If true, it performs a dry run where jobs are not really executed and simulate that they finish inmediately. Useful for testing. Default False.
  • external_monitor_entrypoint: Entrypoint of the external monitor that Cloudify will use instead of the internal one.
  • external_monitor_type: Type of the monitoring system when using an external one. Default {uri-prometheus}[PROMETHEUS].
  • external_monitor_port: Port of the monitor when using an external monitoring system. Default :9090.
  • external_monitor_orchestrator_port: Port of the external monitor to connect with Croupier. Default :8079.

Example

This example demonstrates how to describe a SLURM interface on an HPC.

hpc_interface:
  type: croupier.nodes.InfrastructureInterface
  properties:
    credentials:
      host: "[HPC-HOST]"
      user: "[HPC-SSH-USER]"
      password: "[HPC-SSH-PASS]"
    config:
      country_tz: "Europe/Madrid"
      infrastructure_interface: "SLURM"
    job_prefix: crp
    workdir_prefix: test
 ...

Mapped Operations:

  • cloudify.interfaces.lifecycle.configure Checks that there is a connection between Cloudify and the infrastructure interface, and creates a new working directory.
  • cloudify.interfaces.lifecycle.delete Clean up all data generated by the execution.
  • cloudify.interfaces.monitoring.start If the external monitor orchestrator is available, sends a notification to start monitoring the infrastructure.
  • cloudify.interfaces.monitoring.stop If the external monitor orchestrator is available, sends a notification to end monitoring the infrastructure.

croupier.nodes.Job

Use this type to describe a job (a task that will execute on the infrastructure).

Properties:

  • job_options: Job parameters and needed resources.

    • pre: List of commands to be executed before running the job. Optional.
    • post: List of commands to be executed after running the job. Optional.
    • partition: Partition in which the job will be executed. If not provided, the HPC default will be used.
    • commands: List of commands to be executed. Mandatory if script property is not present.
    • script: Script to be executed. Mandatory if commands property is not present.
    • arguments: List of arguments to be passed to execution command. Variables must be scaped like “\$USER”
    • nodes: Nodes to use in job. Default 1.
    • tasks: Number of tasks of the job. Default 1.
    • tasks_per_node: Number of tasks per node. Default 1.
    • max_time: Set a limit on the total run time of the job allocation. Mandatory if no script is provided, or if the script does not define such property.
    • scale: Execute in parallel the job N times according to this property. Only for HPC. Default 1 (no scale).
    • scale_max_in_parallel: Maximum number of scaled job instances that can be run in parallel. Only works with scale > 1. Default same as scale.
    • memory: Specify the real memory required per node. Different units can be specified using the suffix [K|M|G|T]. Default value "" lets the infrastructure interface assign the default memory to the job.
    • stdout_file: Define the file where to gather the standard output of the job. Default value "" sets <job-name>.err filename.
    • stderr_file: Define the file where to gather the standard error output. Default value "" sets <job-name>.out filename.
    • mail-user: Email to receive notification of job state changes. Default value "" does not send any mail.
    • mail-type: Type of event to be notified by mail, can define several events separated by comma. Valid values NONE, BEGIN, END, FAIL, TIME_LIMIT, REQUEUE, ALL. Default value "" does not send any mail.
    • reservation: Allocate resources for the job from the named reservation. Default value "" does not allocate from any named reservation.
    • qos: Request a quality of service for the job. Default value "" lets de infrastructure interface assign the default user qos.
  • deployment: Scripts to perform deployment operations. Optional.

    • bootstrap: Relative path to blueprint to the script that will be executed in the HPC at the install workflow to bootstrap the job (like data movements, binary download, etc.)
    • revert: Relative path to blueprint to the script that will be executed in the HPC at the uninstall workflow, reverting the bootstrap or other clean up operations.
    • inputs: List of inputs that will be passed to the scripts when executed in the HPC.
  • publish: A list of outputs to be published after job execution. Each list item is a dictionary containing:

    • type: Type of the external repository to be published. Only CKAN is supported for now. The rest of the parameters depends on the type.
    • type: CKAN
      • entrypoint: ckan entrypoint
      • api_key: Individual user ckan api key.
      • dataset: Id of the dataset in which the file will be published.
      • file_path: Local path of the output file in the computation node.
      • name: Name used to publish the file in the repository.
      • description: Text describing the data file.
  • skip_cleanup: Set to true to not clean up orchestrator auxiliar files. Default False.

    Note

    The variable $CURRENT_WORKDIR is available in all operations and scripts. It points to the working directory of the execution in the HPC from the HOME directory: /home/user/$CURRENT_WORKDIR/.

    Note

    The variables $SCALE_INDEX, $SCALE_COUNT and $SCALE_MAX are available in all commands and inside the scripts where # DYNAMIC VARIABLES exist (they will be dynamicaly loaded after this line). They hold, for each job instance, the index, the total number of instances, and the maximun in parallel respectively.

Example

This example demonstrates how to describe a job.

hpc_job:
  type: croupier.nodes.Job
  properties:
    job_options:
      partition: { get_input: partition_name }
      commands: ["touch job-$SCALE_INDEX.test"]
      nodes: 1
      tasks: 1
      tasks_per_node: 1
      max_time: "00:01:00"
      scale: 4
    skip_cleanup: True
  relationships:
  - type: task_managed_by_interface
    target: hpc_interface
 ...

This example demonstrates how to describe an script job.

hpc_job:
  type: croupier.nodes.Job
  properties:
    job_options:
      script: "touch.script"
      arguments:
          - "job-\\$SCALE_INDEX.test"
      nodes: 1
      tasks: 1
      tasks_per_node: 1
      max_time: "00:01:00"
      partition: { get_input: partition_name }
      scale: 4
    deployment:
      bootstrap: "scripts/create_script.sh"
      revert: "scripts/delete_script.sh"
      inputs:
        - "script-"
    skip_cleanup: True
  relationships:
    - type: task_managed_by_interface
      target: hpc_interface
 ...

Mapped Operations:

  • cloudify.interfaces.lifecycle.start Send and execute the bootstrap script.
  • cloudify.interfaces.lifecycle.stop Send and execute the revert script.
  • croupier.interfaces.lifecycle.queue Queues the job in the HPC.
  • croupier.interfaces.lifecycle.publish Publish outputs outside the HPC.
  • croupier.interfaces.lifecycle.cleanup Clean up operations after job is finished.
  • croupier.interfaces.lifecycle.cancel Cancels a queued job.

croupier.nodes.SingularityJob

Derived From: croupier_nodes_job

Use this tipe to describe a job executed from a Singularity container.

Properties:

  • job_options: Job parameters and needed resources.

    • pre: List of commands to be executed before running singularity container. Optional.
    • post: List of commands to be executed after running singularity container. Optional.
    • image: Singularity image file.
    • home: Home volume that will be bind with the image instance (Optional).
    • volumes: List of volumes that will be bind with the image instance.
    • partition: Partition in which the job will be executed. If not provided, the HPC default will be used.
    • nodes: Necessary nodes of the job. 1 by default.
    • tasks: Number of tasks of the job. 1 by default.
    • tasks_per_node: Number of tasks per node. 1 by default.
    • max_time: Set a limit on the total run time of the job allocation. Mandatory if no script is provided.
    • scale: Execute in parallel the job N times according to this property. Default 1 (no scale).
    • scale_max_in_parallel: Maximum number of scaled job instances that can be run in parallel. Only works with scale > 1. Default same as scale.
    • memory: Specify the real memory required per node. Different units can be specified using the suffix [K|M|G|T]. Default value "" lets the infrastructure interface assign the default memory to the job.
    • stdout_file: Define the file where to gather the standard output of the job. Default value "" sets <job-name>.err filename.
    • stderr_file: Define the file where to gather the standard error output. Default value "" sets <job-name>.out filename.
    • mail-user: Email to receive notification of job state changes. Default value "" does not send any mail.
    • mail-type: Type of event to be notified by mail, can define several events separated by comma. Valid values NONE, BEGIN, END, FAIL, TIME_LIMIT, REQUEUE, ALL. Default value "" does not send any mail.
    • reservation: Allocate resources for the job from the named reservation. Default value "" does not allocate from any named reservation.
    • qos: Request a quality of service for the job. Default value "" lets de infrastructure interface assign the default user qos.
  • deployment: Optional scripts to perform deployment operations (bootstrap and revert).

    • bootstrap: Relative path to blueprint to the script that will be executed in the HPC at the install workflow to bootstrap the job (like image download, data movements, etc.)
    • revert: Relative path to blueprint to the script that will be executed in the HPC at the uninstall workflow, reverting the bootstrap or other clean up operations (like removing the image).
    • inputs: List of inputs that will be passed to the scripts when executed in the HPC
  • skip_cleanup: Set to true to not clean up orchestrator auxiliar files. Default False.

    Note

    The variable $CURRENT_WORKDIR is available in all operations and scripts. It points to the working directory of the execution in the HPC from the HOME directory: /home/user/$CURRENT_WORKDIR/.

    Note

    The variables $SCALE_INDEX, $SCALE_COUNT and $SCALE_MAX are available when scaling, holding for each job instance the index, the total number of instances, and the maximun in parallel respectively.

Example

This example demonstrates how to describe a new job executed in a Singularity container.

singularity_job:
  type: croupier.nodes.SingularityJob
    properties:
    job_options:
      pre:
      - { get_input: mpi_load_command }
      - { get_input: singularity_load_command }
      partition: { get_input: partition_name }
      image: {
          concat:
              [
                  { get_input: singularity_image_storage },
                  "/",
                  { get_input: singularity_image_filename },
              ],
      }
      volumes:
      - { get_input: scratch_voulume_mount_point }
      - { get_input: singularity_mount_point }
      commands: ["touch singularity.test"]
      nodes: 1
      tasks: 1
      tasks_per_node: 1
      max_time: "00:01:00"
    deployment:
        bootstrap: "scripts/singularity_bootstrap_example.sh"
        revert: "scripts/singularity_revert_example.sh"
        inputs:
        - { get_input: singularity_image_storage }
        - { get_input: singularity_image_filename }
        - { get_input: singularity_image_uri }
        - { get_input: singularity_load_command }
    skip_cleanup: True
  relationships:
      - type: task_managed_by_interface
        target: hpc_interface
 ...

Mapped Operations:

  • cloudify.interfaces.lifecycle.start Send and execute the bootstrap script.
  • cloudify.interfaces.lifecycle.stop Send and execute the revert script.
  • croupier.interfaces.lifecycle.queue Queues the job in the HPC.
  • croupier.interfaces.lifecycle.publish Publish outputs outside the HPC.
  • croupier.interfaces.lifecycle.cleanup Clean up operations after job is finished.
  • croupier.interfaces.lifecycle.cancel Cancels a queued job.

Relationships

See the relationships section.

The following plugin relationship operations are defined in the HPC plugin:

Tests

To run the tests Cloudify CLI has to be installed locally. Example blueprints can be found at tests/blueprint folder and have the simulate option active by default. Blueprint to be tested can be changed at workflows_tests.py in the tests folder.

To run the tests against a real HPC / Monitor system, copy the file blueprint-inputs.yaml to local-blueprint-inputs.yaml and edit with your credentials. Then edit the blueprint commenting the simulate option, and other parameters as you wish (e.g change the name ft2_node for your own hpc name). To use the openstack integration, your private key must be put in the folder inputs/keys.

Note

dev-requirements.txt needs to be installed (windev-requirements.txt for windows):

pip install -r dev-requirements.txt

To run the tests, run tox on the root folder

tox -e flake8,unit,integration