Torchrun
This provider runs a distributed PyTorch training on multiple nodes.
You can specify the Python file with the training script, the number of nodes, the version of Python,
a requirements.txt
file, environment variables, arguments, which folders to save as output artifacts
(on the master node), dependencies to other workflows if any, and the resources the workflow needs on each node
(e.g. CPU, GPU, memory, etc).
Example usage
Basic example
workflows:
- name: train
provider: torchrun
file: "train.py"
requirements: "requirements.txt"
artifacts: ["model"]
nodes: 4
resources:
gpu:
name: "K80"
count: 4
Alternatively, you can use this provider from the CLI (without defining your workflow
in the .dstack/workflows.yaml
file):
dstack run torchrun train.py --nnodes 4 \
-r requirements.txt -a model \
--gpu 4 --gpu-name K80
Workflows file reference
The following arguments are required:
file
- (Required) The Python file with the training script, e.g."train.py"
The following arguments are optional:
nodes
- (Required) The number of nodes. By default, it's1
.args
- (Optional) The list of arguments for the Python programbefore_run
- (Optional) The list of shell commands to run before running the Python filerequirements
- (Optional) The path to therequirements.txt
fileversion
- (Optional) The major version of Python. By default, it's3.10
.environment
- (Optional) The list of environment variablesartifacts
- (Optional) The list of folders that must be saved as output artifactsresources
- (Optional) The hardware resources required by the workflowworking_dir
- (Optional) The path to the working directory
resources
The hardware resources required by the workflow
cpu
- (Optional) The number of CPU coresmemory
(Optional) The size of RAM memory, e.g."16GB"
gpu
- (Optional) The number of GPUs, their model name and memoryshm_size
- (Optional) The size of shared memory, e.g."8GB"
interruptible
- (Optional)true
if the workflow can run on interruptible instances. By default, it'sfalse
.
gpu
The number of GPUs, their name and memory
count
- (Optional) The number of GPUsmemory
(Optional) The size of GPU memory, e.g."16GB"
name
(Optional) The name of the GPU model (e.g."K80"
,"V100"
, etc)
CLI reference
usage: dstack run torchrun [-d] [-h] [-r REQUIREMENTS] [-e ENV] [-a ARTIFACT]
[--working-dir WORKING_DIR] [-i] [--cpu CPU]
[--memory MEMORY] [--gpu GPU_COUNT]
[--gpu-name GPU_NAME] [--gpu-memory GPU_MEMORY]
[--shm-size SHM_SIZE] [--nnodes [NNODES]]
FILE [ARGS ...]
The following arguments are required:
FILE
- (Required) The Python file with the training script
The following arguments are optional:
-d
,--detach
- (Optional) Do not poll for status update and logs--nnodes [NNODES]
- (Optional) The number of nodes. By default, it's1
.--working-dir WORKING_DIR
- (Optional) The path to the working directory-r REQUIREMENTS
,--requirements REQUIREMENTS
- (Optional) The path to therequirements.txt
file-e ENV
,--env ENV
- (Optional) The list of environment variables-a ARTIFACT
,--artifact ARTIFACT
- (Optional) A folder that must be saved as output artifact--cpu CPU
- (Optional) The number of CPU cores--memory MEMORY
- The size of RAM memory, e.g."16GB"
--gpu GPU_COUNT
- (Optional) The number of GPUs--gpu-name GPU_NAME
- (Optional) The name of the GPU model (e.g."K80"
,"V100"
, etc)--gpu-memory GPU_MEMORY
- (Optional) The size of GPU memory, e.g."16GB"
--shm-size SHM_SIZE
- (Optional) The size of shared memory, e.g."8GB"
-i
,--interruptible
- (Optional) if the workflow can run on interruptible instances.ARGS
- (Optional) The list of arguments for the Python program