dstack to manage clusters of on-prem servers for AI workloads with ease

Community Article Published October 10, 2024

image/png

If you don't know what dstack is yet, please refer to this post and the official documents to grasp the basic understandings of dstack. In simple terms, dstack is computing resource management toolkit with a primary focus on AI development, training, and development.

In the beginning, dstack showed a great way to manage and control multiple machines from various cloud services including GCP(Google Cloud Platform), AWS(Aamazon Web Service), Microsoft Azure, OCI(Oracle Cloud Infrastructure), Lambda Labs, RunPod, Vast.ai, DataCrunch, and CUDO with the support of CPU, NVIDIA GPU, AMD GPU, and TPU. This makes a lot of sense because you can find the best resources(machines) that suits to your requirements(spec, cost, ..), and then instruct any machines from different sources in a uniform way.

From the release of verion 0.18.7, dstack has evolved to manage not only cloud resources but also on-prem resources via ssh-fleet feature. The best part of this feature is that you don't need to know anything about kubernetes or slurm, and it works at the minimum dependencies with (almost) a plain docker technology. Here are some of the advantages of ssh-fleet:

  • easy setup (no kubernetes. no slurm)
    • In order to setup kubernetes or slurm, there are a lot of prior knowledge that you need to study. Also, there are huge amount of engineering effort to actually set them up, run them, and manage them. For dstack's ssh-fleet, there is almost nothing that you need to know about except what you already have such as installing cuda and docker.
  • gather scattered local machines as clusters
    • Not all organizations have a dedicated on-prem cloud computing infrastructure. There are lots of labs managing their own computing resources per projects. However, in these days, we are dealing with larger and larger machine learning models such as large language models(LLMs), and it often requires multi-node collaboration. For dstack's ssh-fleet, You can simply manage multiple resources as a cluster, then assing jobs with single node or multi-node setups.
  • centralized management between cloud and on-prem resources
    • Machine learning is all about running lots of experiments to find out the best model for you problem. This means there should be multiple experiments running in parallel. Otherwise, you need to spend too much time. With dstack, you can assign more experiments to cloud resources while keeping your on-prem resources busy.

Now, let's go through the basic tutorial on how to setup your own ssh-fleet with dstack.

Pre-requisites for ssh-fleet

On the remote server side

  1. Install docker
  • Docker provides the containerization technology that dstack relies on to encapsulate your applications and their dependencies. It ensures consistency and reproducibility across different environments.
  • Follow the official Docker installation instructions for your Linux distribution. This typically involves adding Docker's repository and then using your package manager (apt, yum, etc.) to install the docker-ce package. After successful installation, you can verify if docker is up and running with the following command. It should print out Hello from Docker message in terminal:
$ sudo docker run hello-world
  1. Install cuda toolkit >= 12.1
  • If you plan to use NVIDIA GPUs for your AI workloads, the CUDA Toolkit is essential. It provides the necessary libraries and tools for your applications to utilize the GPU's processing power. dstack requires CUDA 12.1 or higher for compatibility and to leverage the latest features.
  • Download the CUDA Toolkit installer from NVIDIA's website and follow the installation instructions. Make sure to choose the correct version for your Linux distribution and system architecture.
  1. Install cuda container toolkit
  • The CUDA Container Toolkit allows Docker containers to access and utilize NVIDIA GPUs. This is crucial for running GPU-accelerated AI workloads within dstack.
  • Again, refer to NVIDIA's official documentation. You'll typically need to add NVIDIA's container toolkit repository and then install the nvidia-container-toolkit package using your package manager.

If you are a AMD GPU user, instead of the step 2 and 3, install AMD specific drivers by following the release note.

  1. sudo visudo for username ALL=(ALL) NOPASSWD: ALL
  • This configuration allows the dstack server to execute commands on the remote server without requiring a password. This is necessary for dstack to automatically manage containers and resources on your behalf. It is worth noting that this grants significant privileges to the specified user. Ensure this user is dedicated to dstack operations and apply appropriate security measures.
  • Open the /etc/sudoers file using sudo visudo with below command. Add the line username ALL=(ALL) NOPASSWD: ALL, replacing username with the actual username you'll use to connect to the remote server. After this, the username could run any command via sudo mode without password entering prompt:
$ sudo visudo

On the local side

  1. generate id_rsa
  • SSH keys provide a secure way to authenticate with your remote servers without needing to enter a password each time. dstack uses these keys to establish secure connections to your on-prem machines for automated cluster management.
  • Use the ssh-keygen command on your local machine to generate an SSH key pair as below. This will create a private key (id_rsa) and a public key (id_rsa.pub):
$ ssh-keygen -t rsa 
  1. ssh-copy-id
  • This step allows your local machine to automatically authenticate with the remote server using the SSH key pair, simplifying the connection process and enabling dstack to manage the remote server without manual intervention.
  • Run ssh-copy-id username@remote_host on your local machine as below, replacing username and remote_host with the appropriate values. This command copies your public key to the remote server's authorized_keys file. After this, you can directly ssh connect to the remote server without needing to enter password via prompt:
$ ssh-copy-id username@remote_host

Install dstack and register ssh fleets on local side

  1. install dstack

Use pip to install dstack and all its optional dependencies. You don't need to specify [all] if you want to use dstack just for managing on-prem clusters, but [all] is helpful when you want to manage both on-prem and all other cloud resources simultaneously:

$ pip install "dstack[all]"
  1. run dstack server

Starts the dstack server on your local machine. The dstack server is the core component that manages your resources, schedules jobs, and handles communication between your local machine and your compute resources (both cloud and on-prem).

$ dstack server
  1. write fleet.dstack.yml

Define a YAML file of your ssh-fleet something like below. There are a number of configurations that you can make (see the dstack's official API doc), but below shows the essentials. You can follow the Pre-requisites for ssh-fleet section on this blog post for every servers that you want to have for the ssh-fleet cluster. For instance, below YAML file shows that I have registered 4 servers(2 with 3xRTX6000 Ada, 2 with 2xA6000). Also note that it points to the rsa file that we generated from the On the local side section above:

type: fleet
# The name is optional, if not specified, generated randomly
name: my-ssh-fleet

# Ensure instances are interconnected
placement: cluster

# The user, private SSH key, and hostnames of the on-prem servers
ssh_config:
  user: username
  identity_file: ~/.ssh/id_rsa
  hosts:
    - xxx.xxx.171.224
    - xxx.xxx.171.225
    - xxx.xxx.164.172
    - xxx.xxx.165.51

Note that placement: cluster means to ensure instances(servers) are interconnected like sharing the same network. If listed instances do not share the same network, the ssh-fleet provisioning will fail. However, if they do, and placement: cluster is set, you can run multi-node job such as distributed AI model training.

  1. apply fleet.dstack.yml

Tell dstack to read the fleet.dstack.yml file and create the ssh-fleet based on your configuration. dstack will attempt to connect to each of the specified hosts using the provided SSH credentials.

$ dstack apply -f fleet.dstack.yml

List the available fleets in your dstack setup. You should see your my-ssh-fleet listed with details about the connected instances (servers), their resources (CPU, memory, GPU, disk), and their current status.

$ dstack fleet
 FLEET         INSTANCE  BACKEND       RESOURCES                                            PRICE  STATUS  CREATED
 my-ssh-fleet  1         ssh (remote)  32xCPU, 503GB, 3xRTX6000Ada (48GB), 1555.1GB (disk)  $0.0   idle    2 weeks ago
               2         ssh (remote)  32xCPU, 503GB, 3xRTX6000Ada (48GB), 1555.1GB (disk)  $0.0   idle    2 weeks ago
               3         ssh (remote)  64xCPU, 693GB, 2xA6000 (48GB), 1683.6GB (disk)       $0.0   idle    2 weeks ago
               4         ssh (remote)  64xCPU, 693GB, 2xA6000 (48GB), 1683.6GB (disk)       $0.0   idle    2 weeks ago

Also, from the terminal where you run the dstack server, you should see the similar logs as below which indicates that dstack has successfully found and established the connections with the listed servers:

[08:24:07] INFO     dstack._internal.server.background.tasks.process_instances:190 Adding ssh instance my-ssh-fleet-0...
           INFO     dstack._internal.server.background.tasks.process_instances:325 Connected to user xxx.xxx.171.224
[08:24:13] INFO     dstack._internal.server.background.tasks.process_instances:190 Adding ssh instance my-ssh-fleet-1...
           INFO     dstack._internal.server.background.tasks.process_instances:325 Connected to user xxx.xxx.171.225
[08:24:17] INFO     dstack._internal.server.background.tasks.process_instances:190 Adding ssh instance my-ssh-fleet-2...
[08:24:18] INFO     dstack._internal.server.background.tasks.process_instances:325 Connected to user xxx.xxx.164.172
[08:24:23] INFO     dstack._internal.server.background.tasks.process_instances:190 Adding ssh instance my-ssh-fleet-3...
           INFO     dstack._internal.server.background.tasks.process_instances:325 Connected to user xxx.xxx.165.51
[08:24:41] INFO     dstack._internal.server.background.tasks.process_instances:245 The instance my-ssh-fleet-0 (xxx.xxx.171.224) was successfully added
[08:24:42] INFO     dstack._internal.server.background.tasks.process_instances:245 The instance my-ssh-fleet-3 (xxx.xxx.165.51) was successfully added
[08:24:45] INFO     dstack._internal.server.background.tasks.process_instances:245 The instance my-ssh-fleet-1 (xxx.xxx.171.225) was successfully added
[08:24:57] INFO     dstack._internal.server.background.tasks.process_instances:245 The instance my-ssh-fleet-2 (xxx.xxx.164.172) was successfully added
  1. write task.dstack.yml

To test out, I have written a simple YAML file of dstack's task as below for defining an LLM fine-tuning job with Hugging Face's Alignment Handbook framework. Note that I have requested 2 nodes each of which with 3 x RTX6000Ada GPUs:

type: task

nodes: 2
python: "3.11"
nvcc: true

env:
  - HUGGING_FACE_HUB_TOKEN
  - WANDB_API_KEY
  - ACCELERATE_LOG_LEVEL=info

commands:
  - cd alignment-handbook
  - python -m pip install .
  - python -m pip install flash-attn --no-build-isolation

  - pip install wandb
  - pip install huggingface-hub==0.24.7

  - accelerate launch
    --config_file recipes/accelerate_configs/multi_gpu.yaml
    --main_process_ip=$DSTACK_MASTER_NODE_IP
    --main_process_port=8008
    --machine_rank=$DSTACK_NODE_RANK
    --num_processes=$DSTACK_GPUS_NUM
    --num_machines=$DSTACK_NODES_NUM
    scripts/run_sft.py
    recipes/custom.yaml

ports:
  - 50002

resources:
  gpu:
    name: rtx6000ada
    memory: 48GB
    count: 3
  shm_size: 24GB

dstack let users to define a three different types of job. A dev environment lets you provision a remote machine with your code, dependencies, and resources, and access it with your desktop IDE. A task allows you to schedule a job or run a web app. It lets you configure dependencies, resources, ports, and more. Tasks can be distributed and run on clusters. A service allows you to deploy a web app or a model as a scalable endpoint. It lets you configure dependencies, resources, authorizarion, auto-scaling rules, etc.

service is not supported for on-prem environment since it requires a gateway in the current dstack version(0.18.17), but this requirement will soon to be lifted in the future release.

  1. apply task.dstack.yml

Apply previously written task.dstack.yml with dstack apply -f command as below. Then it will show the registered target servers to provision the job. When you enter y on the prompt, the fine-tuning job will be launched:

$ dstack apply -f task.dstack.yml

 Configuration          train.dstack.yml
 Project                main
 User                   admin
 Pool                   default-pool
 Min resources          2..xCPU, 8GB.., 2xGPU (48GB), 100GB.. (disk)
 Max price              -
 Max duration           72h
 Spot policy            on-demand
 Retry policy           no
 Creation policy        reuse-or-create
 Termination policy     destroy-after-idle
 Termination idle time  5m

 #  BACKEND  REGION  INSTANCE  RESOURCES                                            SPOT  PRICE
 1  ssh      remote  instance  32xCPU, 503GB, 3xRTX6000Ada (48GB), 1555.1GB (disk)  no    $0     idle
 2  ssh      remote  instance  32xCPU, 503GB, 3xRTX6000Ada (48GB), 1555.1GB (disk)  no    $0     idle 

(BONUS) Register other cloud services at the same time

Now, we have registered on-prem servers as a cluster with dstack's ssh-fleet. However, you may want to benefit from cloud services at the same time. For instance, this could be particularly useful if you have multiple fine-tuning experiments to run. In this case, you can assign some experiments to the on-prem cluster while you can assign some other experiments on the cloud service. This will significantly reduce the time spending while maximizing the cost expenditure.

To do this, simply follow dstack official document on server/config.yml to add your favorite cloud services. For instance, GCP backend could be added with application default credentials with gcloud CLI toolkit. Or, you can add GCP backend with fine-grained control with service account credentials.

After having both on-prem cluster and cloud service as backends, dstack apply command tries to find out appropriate instances from cloud service by default. Append --backend remote option when you want to provision jobs on the on-prem cluster.

# to target cloud service
$ dstack apply -f task.dstack.yml

# to target on-prem cluster
$ dstack apply -f task.dstack.yml --backend remote

Concluding thoughts

dstack's ssh-fleet feature offers a streamlined approach to managing on-prem clusters for AI workloads. By simplifying setup and centralizing control, dstack empowers AI practitioners to efficiently leverage their on-prem resources, whether it's for large-scale model training or running multiple experiments. The ability to seamlessly integrate with cloud services further enhances flexibility and scalability, making dstack a valuable tool in modern AI development.

As dstack continues to evolve, we can anticipate even more powerful features and broader support for various hardware and software configurations. This continuous development promises to further solidify dstack's position as a versatile and indispensable tool for managing AI infrastructure, both on-prem and in the cloud.