NVIDIA MPS
NVIDIA's Multi-Process Service (MPS) allows multiple processes to share a GPU efficiently by reducing scheduling overhead. MPS can improve GPU resource sharing between processes when a single process cannot fully saturate the GPU's compute capacity.
Follow these steps to configure MPS on Vista for optimized multi-process workflows:
-
Configure Environment Variables
Set environment variables to define where MPS stores its runtime pipes and logs. In the example below, these are placed in each node's
/tmpdirectory. The/tmpdirectory is ephemeral and cleared after a job ends or a node reboots. Add these lines to your job script or shell session:# Set MPS environment variables export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-logTo retain these logs for later analysis, specify directories in
$SCRATCH,$WORK, or$HOMEfile systems instead of/tmp. -
Launch MPS Control Daemon
Use
ibrunto start the MPS daemon across all allocated nodes. This ensures one MPS control process per node:# Launch MPS daemon on all nodes export TACC_TASKS_PER_NODE=1 # Force one task per node ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d unset TACC_TASKS_PER_NODE # Reset to default task distribution -
Submit Your GPU Job
After enabling MPS, run your CUDA application as usual. For example:
ibrun ./your_cuda_executable -
Optional: Quit MPS daemon on all nodes
export TACC_TASKS_PER_NODE=1 # Force 1 task/node ibrun -np $SLURM_NNODES bash -c "echo quit | nvidia-cuda-mps-control" unset TACC_TASKS_PER_NODE
Sample Job Script
Incorporating the above elements into a job script may look like this:
#!/bin/bash
#SBATCH -J mps_gpu_job # Job name
#SBATCH -o mps_job.%j.out # Output file (%j = job ID)
#SBATCH -t 01:00:00 # Wall time (1 hour)
#SBATCH -N 2 # Number of nodes
#SBATCH -n 8 # Total tasks (4 per node)
#SBATCH -p gh # GPU partition (modify as needed)
#SBATCH -A your_project # Project allocation
# 1. Configure environment
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
# 2. Launch MPS daemon on all nodes
echo "Starting MPS daemon..."
export TACC_TASKS_PER_NODE=1 # Force 1 task/node
ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
unset TACC_TASKS_PER_NODE
sleep 5 # Wait for daemons to initialize
# 3. Run your CUDA application
echo "Launching application..."
ibrun ./your_cuda_executable # Replace with your executable
Notes on Performance
MPS is particularly effective for workloads characterized by:
- Fine-grained GPU operations (many small kernel launches)
- Concurrent processes sharing the same GPU
- Underutilized GPU resources in single-process workflows
You may verify performance gains for your use case using the following command to monitor the node that your job is running on (e.g., c608-052):
login1$ ssh c608-052
c608-052$ nvidia-smi dmon --gpm-metrics=3,12 -s u
The side-by-side plots in Figure 1 illustrate the performance enhancement obtained by running two GPU processes simultaneous on a single Hopper node with MPS. The GPU performance improvement is ~12%, compared to no improvement without MPS. Also, the setup cost on the CPU (about 12 seconds) is completely overlapped, resulting in in a 1.2x total improvement for 2 simultaneous Amber executions. Even better performance is expected for applications which don't load the GPU as much as Amber.
