Mps

NVIDIA MPS

NVIDIA's Multi-Process Service (MPS) allows multiple processes to share a GPU efficiently by reducing scheduling overhead. MPS can improve GPU resource sharing between processes when a single process cannot fully saturate the GPU's compute capacity.

Follow these steps to configure MPS on Vista for optimized multi-process workflows:

Configure Environment Variables

Set environment variables to define where MPS stores its runtime pipes and logs. In the example below, these are placed in each node's /tmp directory. The /tmp directory is ephemeral and cleared after a job ends or a node reboots. Add these lines to your job script or shell session:
```
# Set MPS environment variables
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
```
To retain these logs for later analysis, specify directories in $SCRATCH, $WORK, or $HOME file systems instead of /tmp.

Launch MPS Control Daemon

Use ibrun to start the MPS daemon across all allocated nodes. This ensures one MPS control process per node:

# Launch MPS daemon on all nodes
export TACC_TASKS_PER_NODE=1     # Force one task per node
ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
unset TACC_TASKS_PER_NODE        # Reset to default task distribution

Submit Your GPU Job

After enabling MPS, run your CUDA application as usual. For example:
```
ibrun ./your_cuda_executable 
```

Optional: Quit MPS daemon on all nodes

export TACC_TASKS_PER_NODE=1     # Force 1 task/node
ibrun -np $SLURM_NNODES bash -c "echo quit | nvidia-cuda-mps-control"
unset TACC_TASKS_PER_NODE

Sample Job Script

Incorporating the above elements into a job script may look like this:

#!/bin/bash
#SBATCH -J mps_gpu_job           # Job name
#SBATCH -o mps_job.%j.out        # Output file (%j = job ID)
#SBATCH -t 01:00:00              # Wall time (1 hour)
#SBATCH -N 2                     # Number of nodes
#SBATCH -n 8                     # Total tasks (4 per node)
#SBATCH -p gh                    # GPU partition (modify as needed)
#SBATCH -A your_project          # Project allocation

# 1. Configure environment
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log

# 2. Launch MPS daemon on all nodes
echo "Starting MPS daemon..."
export TACC_TASKS_PER_NODE=1     # Force 1 task/node
ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
unset TACC_TASKS_PER_NODE
sleep 5                          # Wait for daemons to initialize

# 3. Run your CUDA application
echo "Launching application..."
ibrun ./your_cuda_executable     # Replace with your executable

Notes on Performance

MPS is particularly effective for workloads characterized by:

Fine-grained GPU operations (many small kernel launches)
Concurrent processes sharing the same GPU
Underutilized GPU resources in single-process workflows

You may verify performance gains for your use case using the following command to monitor the node that your job is running on (e.g., c608-052):

login1$ ssh c608-052
c608-052$ nvidia-smi dmon --gpm-metrics=3,12 -s u

The side-by-side plots in Figure 1 illustrate the performance enhancement obtained by running two GPU processes simultaneous on a single Hopper node with MPS. The GPU performance improvement is ~12%, compared to no improvement without MPS. Also, the setup cost on the CPU (about 12 seconds) is completely overlapped, resulting in in a 1.2x total improvement for 2 simultaneous Amber executions. Even better performance is expected for applications which don't load the GPU as much as Amber.