Vista User Guide

Last update: July 9, 2025

Notices

New: See TACC Staff's notes on incorporating NVIDIA's Multi-Process Service. (MPS)
Important: Please note TACC's new SU charge policy. (09/20/2024)
Subscribe to Vista User News. Stay up-to-date on Vista's status, scheduled maintenances and other notifications. (09/01/2024)

Introduction

TACC's new AI-centric system, Vista, is in full production for the open science community. Vista serves as a bridge from Frontera to Horizon, the primary system of the U.S. NSF Leadership-Class Computing Facility (LCCF), and marks a departure from the x86-based architecture to one with CPUs based on Advanced RISC Machines architecture. Vista expands the Frontera project's support of Machine Learning and GPU-enabled applications with a system based on NVIDIA Grace Hopper architecture and provides a path to more power efficient computing with NVIDIA's Grace Grace ARM CPUs.

The Grace Hopper Superchip introduces a novel architecture that combines the GPU and CPU in one module. This technology removes the bottleneck of the PCIe bus by connecting the CPU and GPU directly with NVLINK and exposing the CPU and GPU memory space as separate NUMA nodes. This allows the programmer to easily access CPU or GPU memory from either device. This greatly reduces the programming complexity of GPU programs while providing increased bandwidth and reduced latency between CPU and GPU.

The Grace Superchip connects two 72 core Grace CPUs using the same NVLINK technology used in the Grace Hopper Superchip to provide 144 ARM cores in 2 NUMA nodes. Using LPDDR memory, each Superchip offers over 850 GiB/s of memory bandwidth and up to 7 TFlops of double precision performance.

Vista is funded by the National Science Foundation (NSF) via a supplement to the Computing for the Endless Frontier award, Award Abstract #1818253. Please reference TACC when providing any citations.

Account Administration

Crontabs

TACC allows cronjobs but be aware that crontab files are unique to the login node where they were created and are not shared across the login nodes. Crontab files are not allowed on the compute nodes.

Note

All TACC HPC systems host multiple login nodes. When you login, your connection is routed to the next available login node via round-robin DNS. This practice balances the user load across the system.

When creating a crontab file, use the hostname command to determine your exact location, and make note of it:

$ hostname
login2.vista.tacc.utexas.edu

Similarly you can always connect to that login node by specifying its full domain name:

localhost$ ssh login2.vista.tacc.utexas.edu

Important

As with any computation, ensure that cronjobs are run only on the compute nodes

TACC Tips

TACC staff has amassed a database of helpful tips for our users. Access these tips via the tacc_tips module and showTip command as demonstrated below:

$ module load tacc_tips
$ showTip

Tip 40   (See "module help tacc_tips" for features or how to disable)

   Here are four different ways to repeat the last executed command (press enter after each):
     * Use the up arrow
     * Type !!
     * Type !-1
     * Press Ctrl+P

System Architecture

Vista Topology

Vista's compute system is divided into Grace-Grace and Grace-Hopper subsystems networked in two-level fat-tree topology as illustrated in Figure 1. below.

The Grace-Grace (GG) subsystem, a purely CPU-based system, is housed in four racks, each containing 64 Grace-Grace (GG) nodes. Each GG node contains 144 processing cores. A GG node provides over 7 TFlops of double precision performance and up to 1 TiB/s of memory bandwidth. GG nodes connect via an InfiniBand 200 Gb/s fabric to a top rack shelf NVIDIA Quantum-2 MQM9790 NDR switch. In total, the subsystem contains sixty-four 200 Gb/s uplinks to the NDR rack shelf switch.

The Grace-Hopper (GH) subsystem, on the other hand, consists of nodes using the GH200 Grace-Hopper Superchip. Each GH node contains an NVIDIA H200 GPU with 96 GiB of HBM3 memory and a Grace CPU with 120 GiB of LPDDR5X memory and 72 cores. A GH node provides 34 TFlops of FP64 performance and 1979 TFlops of FP16 performance for ML workflows on the H200 chip. The GH subsystem is housed in 19 racks, each containing 32 Grace-Hopper (GH) nodes. These nodes connect via an NVIDIA InfiniBand 400 Gb/s fabric to the NVIDIA Quantum-2 MQM9790 NDR switch having 64 ports of 400Gb/s InfiniBand per port. There are thirty-two 400 Gb/s uplinks to the NDR rack shelf switch. The GH nodes have twice the network bandwidth of the GG nodes.

Each top rack shelf switch in all racks connects to sixteen core switches via dual-400G cables. In total, Vista contains 256 GG nodes and 600 GH nodes. Both sets of nodes are connected with NDR fabric to two local file systems, $HOME and $SCRATCH. These are NFS-based flash file systems from VAST Data. The $HOME file system is designed for a small permanent storage area and is quota'd and backed up daily, while the $SCRATCH file system is designed for short term use from many nodes and is not quota'd but may be purged as needed. These file systems are connected to the management switch, which in turn is fully connected to the core network switches. The $WORK file system is a global Lustre file system connected to all of the TACC HPC resources. It is connected to Vista via LNeT routers.

Tip

See NVIDIA'S Grace Performance Tuning Guide for very detailed information on the Grace system..

Grace Grace Compute Nodes

Vista hosts 256 "Grace Grace" (GG) nodes with 144 cores each. Each GG node provides a performance increase of 1.5 - 2x over the Stampede3's CLX nodes due to increased core count and increased memory bandwidth. Each GG node provides over 7 TFlops of double precision performance and 850 GiB/s of memory bandwidth.

Table 1. GG Specifications

Specification	Value
CPU:	NVIDIA Grace CPU Superchip
Total cores per node:	144 cores on two sockets (2 x 72 cores)
Hardware threads per core:	1
Hardware threads per node:	2x72 = 144
Clock rate:	3.4 GHz
Memory:	237 GB LPDDR
Cache:	64 KB L1 data cache per core; 1MB L2 per core; 114 MB L3 per socket. Each socket can cache up to 186 MB (sum of L2 and L3 capacity).
Local storage:	286 GB `/tmp` partition
DRAM:	LPDDR5

Grace Hopper Compute Nodes

Vista hosts 600 Grace Hopper (GH) nodes. Each GH node has one H200 GPU with 96 GB of HBM3 memory and one Grace CPU with 116 GB of LPDDR memory. The GH node provides 34 TFlops of FP64 performance and 1979 TFlops of FP16 performance for ML workflows on the H200 chip.

Table 2. GH Specifications

Specification	Value
GPU:	NVIDIA H200 GPU
GPU Memory:	96 GB HBM 3
CPU:	NVIDIA Grace CPU
Total cores per node:	72 cores on one socket
Hardware threads per core:	1
Hardware threads per node:	1x48 = 72
Clock rate:	3.1 GHz
Memory:	116 GB DDR5
Cache:	64 KB L1 data cache per core; 1MB L2 per core; 114 MB L3 per socket. Each socket can cache up to 186 MB (sum of L2 and L3 capacity).
Local storage:	286 GB `/tmp` partition
DRAM:	LPDDR5

The Vista login nodes are NVIDIA Grace Grace (GG) nodes, each with 144 cores on two sockets (72 cores/socket) with 237 GB of LPDDR.

Network

The interconnect is based on Mellanox NDR technology with full NDR (400 Gb/s) connectivity between the switches and the GH GPU nodes and with NDR200 (200 Gb/s) connectivity to the GG compute nodes. A fat tree topology connects the compute nodes and the GPU nodes within separate trees. Both sets of nodes are connected with NDR to the $HOME and $SCRATCH file systems.

File Systems

Vista will use a shared VAST file system for the $HOME and $SCRATCH directories.

Important

Vista's $HOME and $SCRATCH file systems are NOT Lustre file systems and do not support setting a stripe count or stripe size.

As with Stampede3, the $WORK file system will also be mounted. Unlike $HOME and $SCRATCH, the $WORK file system is a Lustre file system and supports Lustre's lfs commands. All three file systems, $HOME, $SCRATCH, and $WORK are available from all Vista nodes. The /tmp partition is also available to users but is local to each node. The $WORK file system is available on most other TACC HPC systems as well.

Table 3. File Systems

File System	Type	Quota	Key Features
`$HOME`	VAST	23 GB, 500,000 files	Not intended for parallel or high−intensity file operations. Backed up regularly.
`$WORK`	Lustre	1 TB, 3,000,000 files across all TACC systems Not intended for parallel or high−intensity file operations. See Stockyard system description for more information.	Not backed up.
`$SCRATCH`	VAST	no quota Overall capacity ~10 PB.	Not backed up. Files are subject to purge if access time* is more than 10 days old. See TACC's Scratch File System Purge Policy below.

Scratch File System Purge Policy

Warning

The $SCRATCH file system, as its name indicates, is a temporary storage space. Files that have not been accessed* in ten days are subject to purge. Deliberately modifying file access time (using any method, tool, or program) for the purpose of circumventing purge policies is prohibited.

*The operating system updates a file's access time when that file is modified on a login or compute node. Reading or executing a file/script on a login node does not update the access time, but reading or executing on a compute node does update the access time. This approach helps us distinguish between routine management tasks (e.g. tar, scp) and production use. Use the command ls -ul to view access times.

Running Jobs

Slurm Partitions (Queues)

Vista's job scheduler is the Slurm Workload Manager. Slurm commands enable you to submit, manage, monitor, and control your jobs.

Important

Queue limits are subject to change without notice.
TACC Staff will occasionally adjust the QOS settings in order to ensure fair scheduling for the entire user community.
Use TACC's qlimits utility to see the latest queue configurations.

Table 4. Production Queues

Queue Name	Node Type	Max Nodes per Job (assoc'd cores)	Max Duration	Max Jobs in Queue	Charge Rate (per node-hour)
`gg`	Grace/Grace	32 nodes (4608 cores)	48 hrs	20	0.33 SU
`gh`	Grace/Hopper	64 nodes (4608 cores/64 gpus)	48 hrs	20	1 SUs
`gh-dev`	Grace Hopper	8 nodes (576 cores)	2 hrs	1	1 SU

Job Accounting

Like all TACC systems, Vista's accounting system is based on node-hours: one unadjusted Service Unit (SU) represents a single compute node used for one hour (a node-hour). For any given job, the total cost in SUs is the use of one compute node for one hour of wall clock time plus any charges or discounts for the use of specialized queues, e.g. Stampede3's pvc queue, Lonestar6's gpu-a100 queue, and Frontera's flex queue. The queue charge rates are determined by the supply and demand for that particular queue or type of node used and are subject to change.

Vista SUs billed = (# nodes) x (job duration in wall clock hours) x (charge rate per node-hour)

The Slurm scheduler tracks and charges for usage to a granularity of a few seconds of wall clock time. The system charges only for the resources you actually use, not those you request. If your job finishes early and exits properly, Slurm will release the nodes back into the pool of available nodes. Your job will only be charged for as long as you are using the nodes.

New Charging Policy

Important

Beginning October 1st, 2024, TACC will be implementing a new, minimum SU-charge policy for all jobs run on our systems:

All running jobs will be charged a minimum of 15 minutes of queue time regardless of actual runtime. All other queue factors will remain the same.

For example: a 2-node job in the Frontera's rtx queue which runs for one minute would be charged as follows:

2 nodes * 0.25 hrs * 3 SUs = 1.5SUs

These changes are necessary to ensure equal access to the queues for all users as TACC's user base expands. Larger jobs may be the most affected and we encourage users to do thorough testing at smaller node counts before increasing the size of their jobs in order to reduce the impact of this change.

Note

TACC does not implement node-sharing on any compute resource. Each Vista node can be assigned to only one user at a time; hence a complete node is dedicated to a user's job and accrues wall-clock time for all the node's cores whether or not all cores are used.

Principal Investigators can monitor allocation usage via the TACC User Portal under "Allocations->Projects and Allocations". Be aware that the figures shown on the portal may lag behind the most recent usage. Projects and allocation balances are also displayed upon command-line login.

Tip

To display a summary of your TACC project balances and disk quotas at any time, execute:

login1$ /usr/local/etc/taccinfo # Generally more current than balances displayed on the portals.

Submitting Batch Jobs with `sbatch`

Use Slurm's sbatch command to submit a batch job to one of the Vista queues:

login1$ sbatch myjobscript

Where myjobscript is the name of a text file containing #SBATCH directives and shell commands that describe the particulars of the job you are submitting. The details of your job script's contents depend on the type of job you intend to run.

In your job script you (1) use #SBATCH directives to request computing resources (e.g. 10 nodes for 2 hrs); and then (2) use shell commands to specify what work you're going to do once your job begins. There are many possibilities: you might elect to launch a single application, or you might want to accomplish several steps in a workflow. You may even choose to launch more than one application at the same time. The details will vary, and there are many possibilities. But your own job script will probably include at least one launch line that is a variation of one of the examples described here.

Your job will run in the environment it inherits at submission time; this environment includes the modules you have loaded and the current working directory. In most cases you should run your applications(s) after loading the same modules that you used to build them. You can of course use your job submission script to modify this environment by defining new environment variables; changing the values of existing environment variables; loading or unloading modules; changing directory; or specifying relative or absolute paths to files. Do not use the Slurm --export option to manage your job's environment: doing so can interfere with the way the system propagates the inherited environment.

Table 8. below describes some of the most common sbatch command options. Slurm directives begin with #SBATCH; most have a short form (e.g. -N) and a long form (e.g. --nodes). You can pass options to sbatch using either the command line or job script; most users find that the job script is the easier approach. The first line of your job script must specify the interpreter that will parse non-Slurm commands; in most cases #!/bin/bash or #!/bin/csh is the right choice. Avoid #!/bin/sh (its startup behavior can lead to subtle problems on Vista), and do not include comments or any other characters on this first line. All #SBATCH directives must precede all shell commands. Note also that certain #SBATCH options or combinations of options are mandatory, while others are not available on Vista.

By default, Slurm writes all console output to a file named "slurm-%j.out", where %j is the numerical job ID. To specify a different filename use the -o option. To save stdout (standard out) and stderr (standard error) to separate files, specify both -o and -e options.

Tip

The maximum runtime for any individual job is 48 hours. However, if you have good checkpointing implemented, you can easily chain jobs such that the outputs of one job are the inputs of the next, effectively running indefinitely for as long as needed. See Slurm's -d option.

Table 8. Common `sbatch` Options

Option	Argument	Comments
`-A`	projectid	Charge job to the specified project/allocation number. This option is only necessary for logins associated with multiple projects.
`-a` or `--array`	=tasklist	Vista supports Slurm job arrays. See the Slurm documentation on job arrays for more information.
`-d=`	afterok:jobid	Specifies a dependency: this run will start only after the specified job (jobid) successfully finishes
`-export=`	N/A	Avoid this option on Vista. Using it is rarely necessary and can interfere with the way the system propagates your environment.
`--gres`		TACC does not support this option.
`--gpus-per-task`		TACC does not support this option.
`-p`	queue_name	Submits to queue (partition) designated by queue_name
`-J`	job_name	Job Name
`-N`	total_nodes	Required. Define the resources you need by specifying either: (1) `-N` and `-n`; or (2) `-N` and `-ntasks-per-node`.
`-n`	total_tasks	This is total MPI tasks in this job. See `-N` above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set it to the same value as `-N`.
`-ntasks-per-node` or `-tasks-per-node`	tasks_per_node	This is MPI tasks per node. See `-N` above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set `-ntasks-per-node` to 1.
`-t`	hh:mm:ss	Required. Wall clock time for job.
`-mail-type=`	`begin`, `end`, `fail`, or `all`	Specify when user notifications are to be sent (one option per line).
`-mail-user=`	email_address	Specify the email address to use for notifications. Use with the `-mail-type=` flag above.
`-o`	output_file	Direct job standard output to output_file (without `-e` option error goes to this file)
`-e`	error_file	Direct job error output to error_file
`-mem`	N/A	Not available. If you attempt to use this option, the scheduler will not accept your job.

Launching Applications

The primary purpose of your job script is to launch your research application. How you do so depends on several factors, especially (1) the type of application (e.g. MPI, OpenMP, serial), and (2) what you're trying to accomplish (e.g. launch a single instance, complete several steps in a workflow, run several applications simultaneously within the same job). While there are many possibilities, your own job script will probably include a launch line that is a variation of one of the examples described in this section:

Important

The following examples demonstrate launching within a Slurm job script or an idev session.
Do not launch jobs on the login nodes. See TACC's Good Conduct Policy for more information.

One Serial Application

To launch a serial application, simply call the executable. Specify the path to the executable in either the $PATH environment variable or in the call to the executable itself:

myprogram                      # executable in a directory listed in $PATH
$WORK/apps/myprov/myprogram    # explicit full path to executable
./myprogram                    # executable in current directory
./myprogram -m -k 6 input1     # executable with notional input options

One Multi-Threaded Application

Launch a threaded application the same way. Be sure to specify the number of threads. Note that the default OpenMP thread count is 1.

export OMP_NUM_THREADS=144     # 144 total OpenMP threads (1 per GG core)
./myprogram

One MPI Application

To launch an MPI application, use the TACC-specific MPI launcher ibrun, which is a Vista-aware replacement for generic MPI launchers like mpirun and mpiexec. In most cases the only arguments you need are the name of your executable followed by any arguments your executable needs. When you call ibrun without other arguments, your Slurm #SBATCH directives will determine the number of ranks (MPI tasks) and number of nodes on which your program runs.

#SBATCH -N 4
#SBATCH -n 576
ibrun ./myprogram              # ibrun uses the $SBATCH directives to properly allocate nodes and tasks

To use ibrun interactively, say within an idev session, you can specify:

login1$ idev -N 2 -n 80 -p gg
c123-456$ ibrun ./myprogram    # ibrun uses idev's arguments to properly allocate nodes and tasks

One Hybrid (MPI+Threads) Application

When launching a single application you generally don't need to worry about affinity: both OpenMPI and MVAPICH2 will distribute and pin tasks and threads in a sensible way.

export OMP_NUM_THREADS=8    # 8 OpenMP threads per MPI rank
ibrun ./myprogram           # use ibrun instead of mpirun or mpiexec

As a practical guideline, the product of $OMP_NUM_THREADS and the maximum number of MPI processes per node should not be greater than total number of cores available per node (GG nodes have 144 cores, GH nodes have 72 cores).

MPI Applications - Consecutive

To run one MPI application after another (or any sequence of commands one at a time), simply list them in your job script in the order in which you'd like them to execute. When one application/command completes, the next one will begin.

module load git
module list
./preprocess.sh
ibrun ./myprogram input1    # runs after preprocess.sh completes
ibrun ./myprogram input2    # runs after previous MPI app completes

MPI Application - Concurrent

Coming soon.

More than One OpenMP Application Running Concurrently

You can also run more than one OpenMP application simultaneously on a single node, but you will need to distribute and pin tasks appropriately. In the example below, numactl -C specifies virtual CPUs (hardware threads). According to the numbering scheme for GG cores, CPU () numbers 0-143 are spread across the 144 cores, 1 thread per core.

export OMP_NUM_THREADS=2
numactl -C 0-1 ./myprogram inputfile1 &  # HW threads (hence cores) 0-1. Note ampersand.
numactl -C 2-3 ./myprogram inputfile2 &  # HW threads (hence cores) 2-3. Note ampersand.

wait

Interactive Sessions

Interactive Sessions with `idev` and `srun`

TACC's own idev utility is the best way to begin an interactive session on one or more compute nodes. To launch a thirty-minute session on a single node in the development queue, simply execute:

login1$ idev

You'll then see output that includes the following excerpts:

...
-----------------------------------------------------------------
      Welcome to the Vista Supercomputer          
-----------------------------------------------------------------
...

-> After your `idev` job begins to run, a command prompt will appear,
-> and you can begin your interactive development session. 
-> We will report the job status every 4 seconds: (PD=pending, R=running).

->job status:  PD
->job status:  PD
...
c449-001$

The job status messages indicate that your interactive session is waiting in the queue. When your session begins, you'll see a command prompt on a compute node (in this case, the node with hostname c449-001). If this is the first time you launch idev, the prompts may invite you to choose a default project and a default number of tasks per node for future idev sessions.

For command line options and other information, execute idev --help. It's easy to tailor your submission request (e.g. shorter or longer duration) using Slurm-like syntax:

login1$ idev -p gg -N 2 -n 8 -m 150 # gg queue, 2 nodes, 8 total tasks, 150 minutes

For more information see the idev documentation.

Interactive Sessions using `ssh`

If you have a batch job or interactive session running on a compute node, you "own the node": you can connect via ssh to open a new interactive session on that node. This is an especially convenient way to monitor your applications' progress. One particularly helpful example: login to a compute node that you own, execute top, then press the "1" key to see a display that allows you to monitor thread ("CPU") and memory use.

There are many ways to determine the nodes on which you are running a job, including feedback messages following your sbatch submission, the compute node command prompt in an idev session, and the squeue or showq utilities. The sequence of identifying your compute node then connecting to it would look like this:

login1$ squeue -u bjones
 JOBID       PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
858811     skx-dev     idv46796   bjones  R       0:39      1 c448-004
1ogin1$ ssh c448-004
...
C448-004$

Slurm Environment Variables

Be sure to distinguish between internal Slurm replacement symbols (e.g. %j described above) and Linux environment variables defined by Slurm (e.g. SLURM_JOBID). Execute env | grep SLURM from within your job script to see the full list of Slurm environment variables and their values. You can use Slurm replacement symbols like %j only to construct a Slurm filename pattern; they are not meaningful to your Linux shell. Conversely, you can use Slurm environment variables in the shell portion of your job script but not in an #SBATCH directive.

Warning

For example, the following directive will not work the way you might think:

#SBATCH -o myMPI.o${SLURM_JOB_ID}   # incorrect

Tip

Instead, use the following directive:

#SBATCH -o myMPI.o%j     # "%j" expands to your job's numerical job ID

Similarly, you cannot use paths like $WORK or $SCRATCH in an #SBATCH directive.

For more information on this and other matters related to Slurm job submission, see the Slurm online documentation; the man pages for both Slurm itself (man slurm) and its individual commands (e.g. man sbatch); as well as numerous other online resources.

NVIDIA MPS

NVIDIA's Multi-Process Service (MPS) allows multiple processes to share a GPU efficiently by reducing scheduling overhead. MPS can improve GPU resource sharing between processes when a single process cannot fully saturate the GPU's compute capacity.

Follow these steps to configure MPS on Vista for optimized multi-process workflows:

Configure Environment Variables

Set environment variables to define where MPS stores its runtime pipes and logs. In the example below, these are placed in each node's /tmp directory. The /tmp directory is ephemeral and cleared after a job ends or a node reboots. Add these lines to your job script or shell session:
```
# Set MPS environment variables
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
```
To retain these logs for later analysis, specify directories in $SCRATCH, $WORK, or $HOME file systems instead of /tmp.

Launch MPS Control Daemon

Use ibrun to start the MPS daemon across all allocated nodes. This ensures one MPS control process per node:

# Launch MPS daemon on all nodes
export TACC_TASKS_PER_NODE=1     # Force one task per node
ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
unset TACC_TASKS_PER_NODE        # Reset to default task distribution

Submit Your GPU Job

After enabling MPS, run your CUDA application as usual. For example:
```
ibrun ./your_cuda_executable 
```

Optional: Quit MPS daemon on all nodes

export TACC_TASKS_PER_NODE=1     # Force 1 task/node
ibrun -np $SLURM_NNODES bash -c "echo quit | nvidia-cuda-mps-control"
unset TACC_TASKS_PER_NODE

Sample Job Script

Incorporating the above elements into a job script may look like this:

#!/bin/bash
#SBATCH -J mps_gpu_job           # Job name
#SBATCH -o mps_job.%j.out        # Output file (%j = job ID)
#SBATCH -t 01:00:00              # Wall time (1 hour)
#SBATCH -N 2                     # Number of nodes
#SBATCH -n 8                     # Total tasks (4 per node)
#SBATCH -p gh                    # GPU partition (modify as needed)
#SBATCH -A your_project          # Project allocation

# 1. Configure environment
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log

# 2. Launch MPS daemon on all nodes
echo "Starting MPS daemon..."
export TACC_TASKS_PER_NODE=1     # Force 1 task/node
ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
unset TACC_TASKS_PER_NODE
sleep 5                          # Wait for daemons to initialize

# 3. Run your CUDA application
echo "Launching application..."
ibrun ./your_cuda_executable     # Replace with your executable

Notes on Performance

MPS is particularly effective for workloads characterized by:

Fine-grained GPU operations (many small kernel launches)
Concurrent processes sharing the same GPU
Underutilized GPU resources in single-process workflows

You may verify performance gains for your use case using the following command to monitor the node that your job is running on (e.g., c608-052):

login1$ ssh c608-052
c608-052$ nvidia-smi dmon --gpm-metrics=3,12 -s u

The side-by-side plots in Figure 1 illustrate the performance enhancement obtained by running two GPU processes simultaneous on a single Hopper node with MPS. The GPU performance improvement is ~12%, compared to no improvement without MPS. Also, the setup cost on the CPU (about 12 seconds) is completely overlapped, resulting in in a 1.2x total improvement for 2 simultaneous Amber executions. Even better performance is expected for applications which don't load the GPU as much as Amber.

Figure 1. Usage (SM, Memory and FP32) and SM occupancy percentages for single and dual Amber GPU executions (single-precision) on Hopper H200.

Machine Learning

Vista is well equipped to provide researchers with the latest in Machine Learning frameworks, for example, PyTorch. The installation process will be a little different depending on whether you are using single or multiple nodes. Below we detail how to use PyTorch on our systems for both scenarios.

Running PyTorch (Single Node)

Using the System PyTorch

Follow these steps to use Vista's system PyTorch with a single GPU node.

Request a single compute node in Vista's gh-dev queue using the idev utility:
```
login1.vista(76)$ idev -p gh-dev -N 1 -n 1 -t 1:00:00
```

Load modules

c123-456$ module load gcc cuda
c123-456$ module load python3

Launch Python interpreter and check to see that you can import PyTorch and that it can utilize the GPU nodes:
```
import torch 
torch.cuda.is_available()
```

Installing PyTorch

Depending on your particular application, you may also need to install your own local copy of PyTorch. We recommend using the Python virtual environment to manage machine learning packages. Below we detail how to install PyTorch on our systems with a virtual environment:

Request a single compute node in Vista's gh-dev queue using the idev utility:
```
login1.vista(76)$ idev -p gh-dev -N 1 -n 1 -t 1:00:00
```

Create a Python virtual environment:

c123-456$ module load gcc cuda
c123-456$ module load python3
c123-456$ python3 -m venv /path/to/virtual-env-single-node  # (e.g., $SCRATCH/python-envs/test)

Activate the Python virtual environment:

c123-456$ source /path/to/virtual-env-single-node/bin/activate

Install PyTorch

c123-456$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Testing PyTorch Installation

To test your installation of PyTorch we point you to a few benchmark calculations that are part of PyTorch's tutorials on multi-GPU and multi-node training. See PyTorch's documentation: Distributed Data Parallel in PyTorch. These tutorials include several scripts set up to run single-node training and multi-node training.

Download the benchmark:

c123-456$ cd $SCRATCH (or directory on scratch where you want this repo to reside)
c123-456$ git clone https://github.com/pytorch/examples.git

Run the benchmark on one node (1 GPU):

c123-456$ python3 examples/distributed/ddp-tutorial-series/single_gpu.py 50 10

Running PyTorch (Multi-node)

To run multi-node jobs with Grace Hopper nodes on Vista you will need to use MPI-enabled Python. Follow these instructions to install and test these environments with MPI-enabled Python.

Using System PyTorch

Follow these steps to use Vista's system PyTorch with multiple GPU nodes.

Request a single compute node in Vista's gh-dev queue using the idev utility:
```
login1.vista(76)$ idev -p gh-dev -N 2 -n 2 -t 1:00:00
```

Load modules

c123-456$ module load gcc cuda
c123-456$ module load python3_mpi

Launch Python interpreter and check to see that you can import PyTorch and that it can utilize the GPU nodes:
```
import torch 
torch.cuda.is_available()
```

Installing PyTorch

To run multi-node jobs with Grace Hopper nodes on Vista you will need to use MPI-enabled Python. Below we detail how to install PyTorch with MPI-enabled Python using a virtual environment:

Request two nodes in the gh-dev queue using the idev utility:
```
idev -N 2 -n 2 -p gh-dev -t 01:00:00
```

Create a Python virtual environment:

c123-456$ module load gcc cuda
c123-456$ module load python3_mpi
c123-456$ python3 -m venv /path/to/virtual-env-single-node  # (e.g., $SCRATCH/python-envs/test)

Activate the Python virtual environment:

c123-456$ source /path/to/virtual-env-single-node/bin/activate

Now install PyTorch:

c123-456$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124

Testing PyTorch Installation

To test your installation of multi-node PyTorch we supply a simple script below. To launch this script run the following command:

c123-456$ ibrun -np 2 python3 test.py c123-456

Python script ("test.py")

import os
import argparse

from mpi4py import MPI

import torch
import torch.distributed as dist

# use mpi4py to get the world size and tasks rank
WORLD_SIZE = MPI.COMM_WORLD.Get_size()
WORLD_RANK = MPI.COMM_WORLD.Get_rank()

# use the convention that gets the local rank based on how many
# GPUs there are on the node.
GPU_ID = WORLD_RANK % torch.cuda.device_count()
name = MPI.Get_processor_name()

def run(backend):
    tensor = torch.randn(10000,10000)

    # Need to put tensor on a GPU device for nccl backend
    if backend == 'nccl':
        device = torch.device("cuda:{}".format(GPU_ID))
        tensor = tensor.to(device)
    print("Starting process on " + name+ ":" +torch.cuda.get_device_name(GPU_ID))
    if WORLD_RANK == 0:
        for rank_recv in range(1, WORLD_SIZE):
            dist.send(tensor=tensor, dst=rank_recv)
            print('worker_{} sent data to Rank {}\n'.format(0, rank_recv))
    else:
        dist.recv(tensor=tensor, src=0)
        print('worker_{} has received data from rank {}\n'.format(WORLD_RANK,0))

def init_processes(backend, master_address):
    print("World Rank: %s, World Size: %s, GPU_ID: %s"%(WORLD_RANK,WORLD_SIZE,GPU_ID))
    os.environ["MASTER_ADDR"] = master_address
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group(backend, rank=WORLD_RANK, world_size=WORLD_SIZE)
    run(backend)

if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument("master_node", type=str)
    parser.add_argument("--backend", type=str, default="nccl", choices=['nccl', 'gloo'])
    args = parser.parse_args()
    backend=args.backend
    if torch.cuda.device_count() == 0:
        print("No gpu detected...switching to gloo for backend")
        backend="gloo"
    init_processes(backend=backend,master_address=args.master_node)
    dist.destroy_process_group()

Building Software

Important

TACC maintains a database of currently installed software packages and libraries across all HPC resources. Navigate to TACC's Software List to see where, or if, a particular package is already installed on a particular resource.

If TACC does not have your desired software package already installed, you are welcome to download, build, and install the package in your own account. See Building Third-Party Software in the Software at TACC guide.

The phrase "building software" is a common way to describe the process of producing a machine-readable executable file from source files written in C, Fortran, CUDA, or some other programming language. In its simplest form, building software involves a simple, one-line call or short shell script that invokes a compiler. More typically, the process leverages the power of makefiles, so you can change a line or two in the source code, then rebuild in a systematic way only the components affected by the change. Increasingly, however, the build process is a sophisticated multi-step automated workflow managed by a special framework like autotools or cmake, intended to achieve a repeatable, maintainable, portable mechanism for installing software across a wide range of target platforms.

This section of the user guide does nothing more than introduce the big ideas with simple one-line examples. You will undoubtedly want to explore these concepts more deeply using online resources. You will quickly outgrow the examples here. We recommend that you master the basics of CMake and/or makefiles as quickly as possible: even the simplest computational research project will benefit enormously from the power and flexibility of a CMakefile or makefile-based build process.

NVIDIA Compilers

NVIDIA is the recommended and default compiler suite on Vista.

Here are simple examples that use the NVIDIA compiler to build an executable from source code:

$ nvc mycode.c                    # C source file; executable a.out
$ nvc main.c calc.c analyze.c     # multiple source files
$ nvc mycode.c         -o myexe   # C source file; executable myexe
$ nvcpc mycode.cpp     -o myexe   # C++ source file
$ nvfortran mycode.f90 -o myexe   # Fortran90 source file

Compiling a code that uses OpenMP would look like this:

$ nvc -openmp mycode.c -o myexe   # OpenMP

See the published NVIDIA documentation, available online at https://docs.nvidia.com/hpc-sdk//index.html.

GNU Compilers

The GNU foundation maintains a number of high quality compilers, including a compiler for C (gcc), C++ (g++), and Fortran (gfortran). The gcc compiler is the foundation underneath all three, and the term "gcc" often means the suite of these three GNU compilers.

Load a GCC module to access a recent version of the GNU compiler suite. Avoid using the GNU compilers that are available without a gcc module — those will be older versions based on the "system GCC" that comes as part of the Linux distribution.

Here are simple examples that use the GNU compilers to produce an executable from source code:

$ gcc mycode.c                    # C source file; executable a.out
$ gcc mycode.c          -o myexe  # C source file; executable myexe
$ g++ mycode.cpp        -o myexe  # C++ source file
$ gfortran mycode.f90   -o myexe  # Fortran90 source file
$ gcc -fopenmp mycode.c -o myexe  # OpenMP

Note that some compiler options are the same for both NVIDIA and GNU (e.g. -o), while others are different (e.g. -openmp vs -fopenmp). Many options are available in one compiler suite but not the other. See the online GNU documentation for information on optimization flags and other GNU compiler options.

Compiling and Linking

Building an executable requires two separate steps: (1) compiling (generating a binary object file associated with each source file); and (2) linking (combining those object files into a single executable file that also specifies the libraries that executable needs). The examples in the previous section accomplish these two steps in a single call to the compiler. When building more sophisticated applications or libraries, however, it is often necessary or helpful to accomplish these two steps separately.

Use the -c ("compile") flag to produce object files from source files:

$ nvc -c main.c calc.c results.c

Barring errors, this command will produce object files main.o, calc.o, and results.o. Syntax for the NVIDIA and GNU compilers is similar. You can now link the object files to produce an executable file:

$ nvc main.o calc.o results.o -o myexe

The compiler calls a linker utility (usually /bin/ld) to accomplish this task. Again, syntax for other compilers is similar.

Include and Library Paths

Software often depends on pre-compiled binaries called libraries. When this is true, compiling usually requires using the -I option to specify paths to so-called header or include files that define interfaces to the procedures and data in those libraries. Similarly, linking often requires using the -L option to specify paths to the libraries themselves. Typical compile and link lines might look like this:

$ nvc        -c main.c -I${WORK}/mylib/inc -I${TACC_HDF5_INC}                  # compile
$ nvc main.o -o myexe  -L${WORK}/mylib/lib -L${TACC_HDF5_LIB} -lmylib -lhdf5   # link

On Vista, both the HDF5 and PHDF5 modules define the environment variables $TACC_HDF5_INC and $TACC_HDF5_LIB. Other module files define similar environment variables; see Using Modules to Manage Your Environment for more information.

The details of the linking process vary, and order sometimes matters. Much depends on the type of library: static (.a suffix; library's binary code becomes part of executable image at link time) versus dynamically-linked shared (.so suffix; library's binary code is not part of executable; it's located and loaded into memory at run time). However, the $LD_LIBRARY_PATH environment variable specifies the search path for dynamic libraries. For software installed at the system-level, TACC's modules generally modify the $LD_LIBRARY_PATH automatically. To see whether and how an executable named myexe resolves dependencies on dynamically linked libraries, execute ldd myexe.

MPI Programs

OpenMPI (module ompi) and MVAPICH (module mvapich) are the two MPI libraries available on Vista. After loading an ompi or mvapich2 module, compile and/or link using an MPI wrapper (mpicc, mpicxx, mpif90) in place of the compiler:

$ mpicc    mycode.c   -o myexe   # C source, full build
$ mpicc -c mycode.c              # C source, compile without linking
$ mpicxx   mycode.cpp -o myexe   # C++ source, full build
$ mpif90   mycode.f90 -o myexe   # Fortran source, full build

These wrappers call the compiler with the options, include paths, and libraries necessary to produce an MPI executable using the MPI module you're using. To see the effect of a given wrapper, call it with the -show option:

$ mpicc -show                    # Show compile line generated by call to mpicc; similarly for other wrappers

Building for Performance

Compiler Options

When building software on Vista, we recommend using the most recent NVIDIA compiler and OpenMPI library available on Vista. The most recent versions may be newer than the defaults. Execute module spider nvidia and module spider ompi to see what's installed. When loading these modules you may need to specify version numbers explicitly (e.g. module load nvidia/24.5 and module load ompi/5.0).

Architecture-Specific Flags

The Grace architecture is based on an Arm design that uses Neoverse V2 cores The Neovers V2 core support Arm’s Scalable Vector Extension v2(SVE2) and Advanced SIMD(NEON) technologies. Each core has four 128-bit functional units that support 8 64-bit FMA operations. To compile for this specific architecture, include the -tp neoverse-v2 compile option.

Normally, we do not recommend using the -fast option. But, in this case, since there is only one chip architecture on Vista, and -fast does not enforce -static, it is safe to use the -fast option with the NVIDIA compilers. It will enable optimizations for the Neoverse V2 architecture.

You can also use the environment variable $TACC_VEC_FLAGS. This environment variable sets the following flags:

-Mvect=simd -fast -Mipa=fast,inline

If you use GNU compilers, you can optimize for the Grace architecture using the -mcpu=neoverse-v2 option. You can also use TACC_VEC_FLAGS as with the NVIDIA compilers. It enables the following flags:

-O3 -mcpu-neoverse-v2

NVIDIA Performance Libraries (NVPL)

The NVIDIA Performance Libraries (NVPL) are a collection of high-performance mathematical libraries optimized for the NVIDIA Grace Armv9.0 architecture. These CPU-only libraries are for standard C and Fortran mathematical APIs allowing HPC applications to achieve maximum performance on the Grace platform. The collection includes:

NVIDIA Documentation

Consult the above documents for the details of each library and its API for building and linking codes. The libraries work with both NVHPC and GCC compilers, as well as their corresponding MPI wrappers. All libraries support the OpenMP runtime libraries. Refer to individual libraries documentation for details and API extensions supporting nested parallelism.

Compiler Examples

Example: A compile/link process on Vista may look like the following: This links the code against the NVPL FFT library using the GNU g++ compiler. The features in NVPL FFT are still evolving, please pay close attention and follow the latest NVPL FFT document.

$ module load nvpl
$ g++ mycode.cpp -I$TACC_NVPL_DIR}/include \
        -L$TACC_NVPL_DIR}/lib \
        -lnvpl_fftw \
        -o myprogram

Example: This links the code against the NVPL OpenMP threaded BLAS, LAPACK, and SCALAPACK libraries of 32 bit integer interface using the NVHPC mpif90 wrapper. The cluster capability of BLAS from current NVPL release from NVHPC SDK-24.5 includes openmpi3,4,5 and mpich, choose the one that matches the MPI version in mpif90.

$ module load nvpl            
$ mpif90 -mp -I$TACC_NVPL_DIR}/include  \
       -L${TACC_NVPL_DIR}/lib   \
       -lnvpl_blas_lp64_gomp   \
       -lnvpl_lapack_lp64_gomp  \
      -lnvpl_blacs_lp64_openmpi5 \
      -lnvpl_scalapack_lp64   \
       mycode.f90

When linking using NVHPC compiler, convenient flags -Mnvpl and -Mscalapack are provided. As the behavior of these flags may change during active development, please refer to the latest NVHPC compiler guide for more details.

Using NVPL as BLAS/LAPACK with Third-Party Software

When your third-party software requires BLAS or LAPACK, we recommend that you use NVPL to supply this functionality. Replace generic instructions that include link options like -lblas or -llapack with the NVPL approach described above. Generally there is no need to download and install alternatives like OpenBLAS. However, since the NVPL is a relatively new math library suite targeting the aarch64, its interoperability to other softwares with a special 32 or 64 bit integer interface, or OpenMP runting support are not fully tested yet. If you have issues with NVPL and alternative BLAS, LAPACK libraries are needed, the OpenBLAS based ones are available as a part of NVHPC compiler libraries.

Controlling Threading in NVPL

All NVPL libraries support the both GCC and NVHPC OpenMP runtime libraries. See individual libraries documentation for details and API extensions supporting nested parallelism. NVPL Libraries do not explicitly link any particular OpenMP runtime, they rely on runtime loading of the OpenMP library as determined by the application and environment. Applications linked to NVPL should always use at runtime the same OpenMP distribution the application was compiled with. Mixing OpenMP distributions from compile-time to runtime may result in anomalous performance. Please note that the default library linked with -Mnvpl flag is single threaded as of NVHPC 24.5, -mpflag is needed to linked with the threaded version.

NVIDIA HPC modules provide a libgomp.so symlink to libnvomp.so. This symlink will be on LD_LIBRARY_PATH if NVHPC environment modules are loaded. Use ldd to ensure that applications built with GCC do not accidentally load libgomp.sosymlink from HPC SDK due to LD_LIBRARY_PATH. Use libnvomp.soif if and only if the application was built with NVHPC compilers.

$OMP_NUM_THREADS defaults to 1 on TACC systems. If you use the default value you will get no thread-based parallelism from NVPL. Setting the environment variable $OMP_NUM_THREADS to control the number of threads for optimal performance.

Using NVPL with other MATLAB, PYTHON and R

TACC MATLAB, Python and R modules need BLAS and LAPACK and other math libraries for performance. How to use NVPL with them is under investigation. We will update.

Help Desk

Important

Submit a help desk ticket at any time via the TACC User Portal. Be sure to include "Vista" in the Resource field.

TACC Consulting operates from 8am to 5pm CST, Monday through Friday, except for holidays. Help the consulting staff help you by following these best practices when submitting tickets.

Do your homework before submitting a help desk ticket. What does the user guide and other documentation say? Search the internet for key phrases in your error logs; that's probably what the consultants answering your ticket are going to do. What have you changed since the last time your job succeeded?
Describe your issue as precisely and completely as you can: what you did, what happened, verbatim error messages, other meaningful output.

Tip

When appropriate, include as much meta-information about your job and workflow as possible including:

directory containing your build and/or job script
all modules loaded
relevant job IDs
any recent changes in your workflow that could affect or explain the behavior you're observing.

Subscribe to Vista User News. This is the best way to keep abreast of maintenance schedules, system outages, and other general interest items.
Have realistic expectations. Consultants can address system issues and answer questions about Vista. But they can't teach parallel programming in a ticket, and may know nothing about the package you downloaded. They may offer general advice that will help you build, debug, optimize, or modify your code, but you shouldn't expect them to do these things for you.
Be patient. It may take a business day for a consultant to get back to you, especially if your issue is complex. It might take an exchange or two before you and the consultant are on the same page. If the admins disable your account, it's not punitive. When the file system is in danger of crashing, or a login node hangs, they don't have time to notify you before taking action.

Vista User Guide

Notices

Introduction

Account Administration

Crontabs

TACC Tips

System Architecture

Vista Topology

Grace Grace Compute Nodes

Table 1. GG Specifications

Grace Hopper Compute Nodes

Table 2. GH Specifications

Login Nodes

Network

File Systems

Table 3. File Systems

Scratch File System Purge Policy

Running Jobs

Slurm Partitions (Queues)

Table 4. Production Queues

Job Accounting

New Charging Policy

Submitting Batch Jobs with sbatch

Table 8. Common sbatch Options

Launching Applications

One Serial Application

One Multi-Threaded Application

One MPI Application

One Hybrid (MPI+Threads) Application

MPI Applications - Consecutive

MPI Application - Concurrent

More than One OpenMP Application Running Concurrently

Interactive Sessions

Interactive Sessions with idev and srun

Interactive Sessions using ssh

Slurm Environment Variables

NVIDIA MPS

Sample Job Script

Notes on Performance

Machine Learning

Running PyTorch (Single Node)

Using the System PyTorch

Installing PyTorch

Testing PyTorch Installation

Running PyTorch (Multi-node)

Using System PyTorch

Installing PyTorch

Testing PyTorch Installation

Building Software

NVIDIA Compilers

GNU Compilers

Compiling and Linking

Include and Library Paths

MPI Programs

Building for Performance

Compiler Options

Architecture-Specific Flags

NVIDIA Performance Libraries (NVPL)

NVIDIA Documentation

Compiler Examples

Using NVPL as BLAS/LAPACK with Third-Party Software

Controlling Threading in NVPL

Using NVPL with other MATLAB, PYTHON and R

Help Desk

References

Submitting Batch Jobs with `sbatch`

Table 8. Common `sbatch` Options

Interactive Sessions with `idev` and `srun`

Interactive Sessions using `ssh`