Vista User Guide
Last update: October 30, 2024
Notices
- Important: Please note TACC's new SU charge policy. (09/20/2024)
- Subscribe to Vista User News. Stay up-to-date on Vista's status, scheduled maintenances and other notifications. (09/01/2024)
Introduction
TACC's new AI-centric system, Vista, is in full production for the open science community. Vista serves as a bridge from Frontera to Horizon, the primary system of the U.S. NSF Leadership-Class Computing Facility (LCCF), and marks a departure from the x86-based architecture to one with CPUs based on Advanced RISC Machines architecture. Vista expands the Frontera project's support of Machine Learning and GPU-enabled applications with a system based on NVIDIA Grace Hopper architecture and provides a path to more power efficient computing with NVIDIA's Grace Grace ARM CPUs.
The Grace Hopper Superchip introduces a novel architecture that combines the GPU and CPU in one module. This technology removes the bottleneck of the PCIe bus by connecting the CPU and GPU directly with NVLINK and exposing the CPU and GPU memory space as separate NUMA nodes. This allows the programmer to easily access CPU or GPU memory from either device. This greatly reduces the programming complexity of GPU programs while providing increased bandwidth and reduced latency between CPU and GPU.
The Grace Superchip connects two 72 core Grace CPUs using the same NVLINK technology used in the Grace Hopper Superchip to provide 144 ARM cores in 2 NUMA nodes. Using LPDDR memory, each Superchip offers over 850 GiB/s of memory bandwidth and up to 7 TFlops of double precision performance.
Vista is funded by the National Science Foundation (NSF) via a supplement to the Computing for the Endless Frontier award, Award Abstract #1818253. Please reference TACC when providing any citations.
System Architecture
Vista Topology
Vista's compute system is divided into Grace-Grace and Grace-Hopper subsystems networked in two-level fat-tree topology as illustrated in Figure 1. below.
The Grace-Grace (GG) subsystem, a purely CPU-based system, is housed in four racks, each containing 64 Grace-Grace (GG) nodes. Each GG node contains 144 processing cores. A GG node provides over 7 TFlops of double precision performance and up to 1 TiB/s of memory bandwidth. GG nodes connect via an InfiniBand 200 Gb/s fabric to a top rack shelf NVIDIA Quantum-2 MQM9790 NDR switch. In total, the subsystem contains sixty-four 200 Gb/s uplinks to the NDR rack shelf switch.
The Grace-Hopper (GH) subsystem, on the other hand, consists of nodes using the GH200 Grace-Hopper Superchip. Each GH node contains an NVIDIA H100 GPU with 96 GiB of HBM3 memory and a Grace CPU with 120 GiB of LPDDR5X memory and 72 cores. A GH node provides 34 TFlops of FP64 performance and 1979 TFlops of FP16 performance for ML workflows on the H100 chip. The GH subsystem is housed in 19 racks, each containing 32 Grace-Hopper (GH) nodes. These nodes connect via an NVIDIA InfiniBand 400 Gb/s fabric to the NVIDIA Quantum-2 MQM9790 NDR switch having 64 ports of 400Gb/s InfiniBand per port. There are thirty-two 400 Gb/s uplinks to the NDR rack shelf switch. The GH nodes have twice the network bandwidth of the GG nodes.
Each top rack shelf switch in all racks connects to sixteen core switches via dual-400G cables. In total, Vista contains 256 GG nodes and 600 GH nodes. Both sets of nodes are connected with NDR fabric to two local file systems, $HOME
and $SCRATCH
. These are NFS-based flash file systems from VAST Data. The $HOME
file system is designed for a small permanent storage area and is quota'd and backed up daily, while the $SCRATCH
file system is designed for short term use from many nodes and is not quota'd but may be purged as needed. These file systems are connected to the management switch, which in turn is fully connected to the core network switches. The $WORK
file system is a global Lustre file system connected to all of the TACC HPC resources. It is connected to Vista via LNeT routers.
Grace Grace Compute Nodes
Vista hosts 256 "Grace Grace" (GG) nodes with 144 cores each. Each GG node provides a performance increase of 1.5 - 2x over the Stampede3's CLX nodes due to increased core count and increased memory bandwidth. Each GG node provides over 7 TFlops of double precision performance and 850 GiB/s of memory bandwidth.
Table 1. GG Specifications
Specification | Value |
---|---|
CPU: | NVIDIA Grace CPU Superchip |
Total cores per node: | 144 cores on two sockets (2 x 72 cores) |
Hardware threads per core: | 1 |
Hardware threads per node: | 2x72 = 144 |
Clock rate: | 3.4 GHz |
Memory: | 237 GB LPDDR |
Cache: | 64 KB L1 data cache per core; 1MB L2 per core; 114 MB L3 per socket. Each socket can cache up to 186 MB (sum of L2 and L3 capacity). |
Local storage: | 286 GB /tmp partition |
DRAM: | LPDDR5 |
Grace Hopper Compute Nodes
Vista hosts 600 Grace Hopper (GH) nodes. Each GH node has one H200 GPU with 96 GB of HBM3 memory and one Grace CPU with 116 GB of LPDDR memory. The GH node provides 34 TFlops of FP64 performance and 1979 TFlops of FP16 performance for ML workflows on the H200 chip.
Table 2. GH Specifications
Specification | Value |
---|---|
GPU: | NVIDIA H200 GPU |
GPU Memory: | 96 GB HBM 3 |
CPU: | NVIDIA Grace CPU |
Total cores per node: | 72 cores on one socket |
Hardware threads per core: | 1 |
Hardware threads per node: | 1x48 = 72 |
Clock rate: | 3.1 GHz |
Memory: | 116 GB DDR5 |
Cache: | 64 KB L1 data cache per core; 1MB L2 per core; 114 MB L3 per socket. Each socket can cache up to 186 MB (sum of L2 and L3 capacity). |
Local storage: | 286 GB /tmp partition |
DRAM: | LPDDR5 |
Login Nodes
The Vista login nodes are NVIDIA Grace Grace (GG) nodes, each with 144 cores on two sockets (72 cores/socket) with 237 GB of LPDDR.
Network
The interconnect is based on Mellanox NDR technology with full NDR (400 Gb/s) connectivity between the switches and the GH GPU nodes and with NDR200 (200 Gb/s) connectivity to the GG compute nodes. A fat tree topology connects the compute nodes and the GPU nodes within separate trees. Both sets of nodes are connected with NDR to the $HOME
and $SCRATCH
file systems.
File Systems
Vista will use a shared VAST file system for the $HOME
and $SCRATCH
directories.
Important
Vista's $HOME
and $SCRATCH
file systems are NOT Lustre file systems and do not support setting a stripe count or stripe size.
As with Stampede3, the $WORK
file system will also be mounted. Unlike $HOME
and $SCRATCH
, the $WORK
file system is a Lustre file system and supports Lustre's lfs
commands. All three file systems, $HOME
, $SCRATCH
, and $WORK
are available from all Vista nodes. The /tmp
partition is also available to users but is local to each node. The $WORK
file system is available on most other TACC HPC systems as well.
Table 3. File Systems
File System | Type | Quota | Key Features |
---|---|---|---|
$HOME |
VAST | 23 GB, 500,000 files | Not intended for parallel or high−intensity file operations. Backed up regularly. |
$WORK |
Lustre | 1 TB, 3,000,000 files across all TACC systems Not intended for parallel or high−intensity file operations. See Stockyard system description for more information. |
Not backed up. |
$SCRATCH |
VAST | no quota Overall capacity ~10 PB. |
Not backed up. Files are subject to purge if access time* is more than 10 days old. See TACC's Scratch File System Purge Policy below. |
Scratch File System Purge Policy
Warning
The $SCRATCH
file system, as its name indicates, is a temporary storage space. Files that have not been accessed* in ten days are subject to purge. Deliberately modifying file access time (using any method, tool, or program) for the purpose of circumventing purge policies is prohibited.
*The operating system updates a file's access time when that file is modified on a login or compute node. Reading or executing a file/script on a login node does not update the access time, but reading or executing on a compute node does update the access time. This approach helps us distinguish between routine management tasks (e.g. tar
, scp
) and production use. Use the command ls -ul
to view access times.
Running Jobs
Slurm Partitions (Queues)
Vista's job scheduler is the Slurm Workload Manager. Slurm commands enable you to submit, manage, monitor, and control your jobs.
Important
Queue limits are subject to change without notice.
TACC Staff will occasionally adjust the QOS settings in order to ensure fair scheduling for the entire user community.
Use TACC's qlimits
utility to see the latest queue configurations.
Table 4. Production Queues
Queue Name | Node Type | Max Nodes per Job (assoc'd cores) |
Max Duration | Max Jobs in Queue | Charge Rate (per node-hour) |
---|---|---|---|---|---|
gg |
Grace/Grace | 32 nodes (4608 cores) |
48 hrs | 20 | 0.33 SU |
gh |
Grace/Hopper | 64 nodes (4608 cores/64 gpus) |
48 hrs | 20 | 1 SUs |
gh-dev |
Grace Hopper | 8 nodes (576 cores) |
2 hrs | 8 | 1 SU |
Job Accounting
New TACC SU Charging Policy
Important
Beginning October 1st, 2024, TACC will be implementing a new, minimum SU-charge policy for all jobs run on our systems:
All running jobs will be charged a minimum of 15 minutes of queue time regardless of actual runtime. All other queue factors will remain the same.
For example: a 2-node job in the Frontera's rtx
queue which runs for one minute would be charged as follows:
2 nodes * 0.25 hrs * 3 SUs = 1.5SUs
These changes are necessary to ensure equal access to the queues for all users as TACC's user base expands. Larger jobs may be the most affected and we encourage users to do thorough testing at smaller node counts before increasing the size of their jobs in order to reduce the impact of this change.
Like all TACC systems, Vista's accounting system is based on node-hours: one unadjusted Service Unit (SU) represents a single compute node used for one hour (a node-hour). For any given job, the total cost in SUs is the use of one compute node for one hour of wall clock time plus any charges or discounts for the use of specialized queues, e.g. Stampede3's pvc
queue, Lonestar6's gpu-a100
queue, and Frontera's flex
queue. The queue charge rates are determined by the supply and demand for that particular queue or type of node used and are subject to change.
Vista SUs billed = (# nodes) x (job duration in wall clock hours) x (charge rate per node-hour)
The Slurm scheduler tracks and charges for usage to a granularity of a few seconds of wall clock time. The system charges only for the resources you actually use, not those you request. If your job finishes early and exits properly, Slurm will release the nodes back into the pool of available nodes. Your job will only be charged for as long as you are using the nodes.
Note
TACC does not implement node-sharing on any compute resource. Each Vista node can be assigned to only one user at a time; hence a complete node is dedicated to a user's job and accrues wall-clock time for all the node's cores whether or not all cores are used.
Principal Investigators can monitor allocation usage via the TACC User Portal under "Allocations->Projects and Allocations". Be aware that the figures shown on the portal may lag behind the most recent usage. Projects and allocation balances are also displayed upon command-line login.
Tip
To display a summary of your TACC project balances and disk quotas at any time, execute:login1$ /usr/local/etc/taccinfo # Generally more current than balances displayed on the portals.
Submitting Batch Jobs with sbatch
Use Slurm's sbatch
command to submit a batch job to one of the Vista queues:
login1$ sbatch myjobscript
Where myjobscript
is the name of a text file containing #SBATCH
directives and shell commands that describe the particulars of the job you are submitting. The details of your job script's contents depend on the type of job you intend to run.
In your job script you (1) use #SBATCH
directives to request computing resources (e.g. 10 nodes for 2 hrs); and then (2) use shell commands to specify what work you're going to do once your job begins. There are many possibilities: you might elect to launch a single application, or you might want to accomplish several steps in a workflow. You may even choose to launch more than one application at the same time. The details will vary, and there are many possibilities. But your own job script will probably include at least one launch line that is a variation of one of the examples described here.
Your job will run in the environment it inherits at submission time; this environment includes the modules you have loaded and the current working directory. In most cases you should run your applications(s) after loading the same modules that you used to build them. You can of course use your job submission script to modify this environment by defining new environment variables; changing the values of existing environment variables; loading or unloading modules; changing directory; or specifying relative or absolute paths to files. Do not use the Slurm --export
option to manage your job's environment: doing so can interfere with the way the system propagates the inherited environment.
Table 8. below describes some of the most common sbatch
command options. Slurm directives begin with #SBATCH
; most have a short form (e.g. -N
) and a long form (e.g. --nodes
). You can pass options to sbatch
using either the command line or job script; most users find that the job script is the easier approach. The first line of your job script must specify the interpreter that will parse non-Slurm commands; in most cases #!/bin/bash
or #!/bin/csh
is the right choice. Avoid #!/bin/sh
(its startup behavior can lead to subtle problems on Vista), and do not include comments or any other characters on this first line. All #SBATCH
directives must precede all shell commands. Note also that certain #SBATCH
options or combinations of options are mandatory, while others are not available on Vista.
By default, Slurm writes all console output to a file named "slurm-%j.out
", where %j
is the numerical job ID. To specify a different filename use the -o
option. To save stdout
(standard out) and stderr
(standard error) to separate files, specify both -o
and -e
options.
Tip
The maximum runtime for any individual job is 48 hours. However, if you have good checkpointing implemented, you can easily chain jobs such that the outputs of one job are the inputs of the next, effectively running indefinitely for as long as needed. See Slurm's -d
option.
Table 8. Common sbatch
Options
Option | Argument | Comments |
---|---|---|
-A |
projectid | Charge job to the specified project/allocation number. This option is only necessary for logins associated with multiple projects. |
-a or --array |
=tasklist | Vista supports Slurm job arrays. See the Slurm documentation on job arrays for more information. |
-d= |
afterok:jobid | Specifies a dependency: this run will start only after the specified job (jobid) successfully finishes |
-export= |
N/A | Avoid this option on Vista. Using it is rarely necessary and can interfere with the way the system propagates your environment. |
--gres |
TACC does not support this option. | |
--gpus-per-task |
TACC does not support this option. | |
-p |
queue_name | Submits to queue (partition) designated by queue_name |
-J |
job_name | Job Name |
-N |
total_nodes | Required. Define the resources you need by specifying either: (1) -N and -n ; or(2) -N and -ntasks-per-node . |
-n |
total_tasks | This is total MPI tasks in this job. See -N above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set it to the same value as -N . |
-ntasks-per-node or -tasks-per-node |
tasks_per_node | This is MPI tasks per node. See -N above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set -ntasks-per-node to 1. |
-t |
hh:mm:ss | Required. Wall clock time for job. |
-mail-type= |
begin , end , fail , or all |
Specify when user notifications are to be sent (one option per line). |
-mail-user= |
email_address | Specify the email address to use for notifications. Use with the -mail-type= flag above. |
-o |
output_file | Direct job standard output to output_file (without -e option error goes to this file) |
-e |
error_file | Direct job error output to error_file |
-mem |
N/A | Not available. If you attempt to use this option, the scheduler will not accept your job. |
Launching Applications
The primary purpose of your job script is to launch your research application. How you do so depends on several factors, especially (1) the type of application (e.g. MPI, OpenMP, serial), and (2) what you're trying to accomplish (e.g. launch a single instance, complete several steps in a workflow, run several applications simultaneously within the same job). While there are many possibilities, your own job script will probably include a launch line that is a variation of one of the examples described in this section:
Note that the following examples demonstrate launching within a Slurm job script or an idev
session. Do not launch jobs on the login nodes.
One Serial Application
To launch a serial application, simply call the executable. Specify the path to the executable in either the $PATH
environment variable or in the call to the executable itself:
myprogram # executable in a directory listed in $PATH
$WORK/apps/myprov/myprogram # explicit full path to executable
./myprogram # executable in current directory
./myprogram -m -k 6 input1 # executable with notional input options
One Multi-Threaded Application
Launch a threaded application the same way. Be sure to specify the number of threads. Note that the default OpenMP thread count is 1.
export OMP_NUM_THREADS=144 # 144 total OpenMP threads (1 per GG core)
./myprogram
One MPI Application
To launch an MPI application, use the TACC-specific MPI launcher ibrun
, which is a Vista-aware replacement for generic MPI launchers like mpirun
and mpiexec
. In most cases the only arguments you need are the name of your executable followed by any arguments your executable needs. When you call ibrun
without other arguments, your Slurm #SBATCH
directives will determine the number of ranks (MPI tasks) and number of nodes on which your program runs.
#SBATCH -N 4
#SBATCH -n 576
ibrun ./myprogram # ibrun uses the $SBATCH directives to properly allocate nodes and tasks
To use ibrun
interactively, say within an idev
session, you can specify:
login1$ idev -N 2 -n 80 -p gg
c123-456$ ibrun ./myprogram # ibrun uses idev's arguments to properly allocate nodes and tasks
One Hybrid (MPI+Threads) Application
When launching a single application you generally don't need to worry about affinity: both OpenMPI and MVAPICH2 will distribute and pin tasks and threads in a sensible way.
export OMP_NUM_THREADS=8 # 8 OpenMP threads per MPI rank
ibrun ./myprogram # use ibrun instead of mpirun or mpiexec
As a practical guideline, the product of $OMP_NUM_THREADS
and the maximum number of MPI processes per node should not be greater than total number of cores available per node (GG nodes have 144 cores, GH nodes have 72 cores).
MPI Applications - Consecutive
To run one MPI application after another (or any sequence of commands one at a time), simply list them in your job script in the order in which you'd like them to execute. When one application/command completes, the next one will begin.
module load git
module list
./preprocess.sh
ibrun ./myprogram input1 # runs after preprocess.sh completes
ibrun ./myprogram input2 # runs after previous MPI app completes
MPI Application - Concurrent
Coming soon.
More than One OpenMP Application Running Concurrently
You can also run more than one OpenMP application simultaneously on a single node, but you will need to distribute and pin tasks appropriately. In the example below, numactl -C specifies virtual CPUs (hardware threads). According to the numbering scheme for GG cores, CPU () numbers 0-143 are spread across the 144 cores, 1 thread per core.
export OMP_NUM_THREADS=2
numactl -C 0-1 ./myprogram inputfile1 & # HW threads (hence cores) 0-1. Note ampersand.
numactl -C 2-3 ./myprogram inputfile2 & # HW threads (hence cores) 2-3. Note ampersand.
wait
Interactive Sessions
Interactive Sessions with idev
and srun
TACC's own idev
utility is the best way to begin an interactive session on one or more compute nodes. To launch a thirty-minute session on a single node in the development queue, simply execute:
login1$ idev
You'll then see output that includes the following excerpts:
...
-----------------------------------------------------------------
Welcome to the Vista Supercomputer
-----------------------------------------------------------------
...
-> After your `idev` job begins to run, a command prompt will appear,
-> and you can begin your interactive development session.
-> We will report the job status every 4 seconds: (PD=pending, R=running).
->job status: PD
->job status: PD
...
c449-001$
The job status messages indicate that your interactive session is waiting in the queue. When your session begins, you'll see a command prompt on a compute node (in this case, the node with hostname c449-001). If this is the first time you launch idev
, the prompts may invite you to choose a default project and a default number of tasks per node for future idev
sessions.
For command line options and other information, execute idev --help
. It's easy to tailor your submission request (e.g. shorter or longer duration) using Slurm-like syntax:
login1$ idev -p gg -N 2 -n 8 -m 150 # gg queue, 2 nodes, 8 total tasks, 150 minutes
For more information see the idev
documentation.
Interactive Sessions using ssh
If you have a batch job or interactive session running on a compute node, you "own the node": you can connect via ssh to open a new interactive session on that node. This is an especially convenient way to monitor your applications' progress. One particularly helpful example: login to a compute node that you own, execute top, then press the "1" key to see a display that allows you to monitor thread ("CPU") and memory use.
There are many ways to determine the nodes on which you are running a job, including feedback messages following your sbatch submission, the compute node command prompt in an idev
session, and the squeue
or showq
utilities. The sequence of identifying your compute node then connecting to it would look like this:
login1$ squeue -u bjones
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
858811 skx-dev idv46796 bjones R 0:39 1 c448-004
1ogin1$ ssh c448-004
...
C448-004$
Slurm Environment Variables
Be sure to distinguish between internal Slurm replacement symbols (e.g. %j
described above) and Linux environment variables defined by Slurm (e.g. SLURM_JOBID
). Execute env | grep SLURM
from within your job script to see the full list of Slurm environment variables and their values. You can use Slurm replacement symbols like %j
only to construct a Slurm filename pattern; they are not meaningful to your Linux shell. Conversely, you can use Slurm environment variables in the shell portion of your job script but not in an #SBATCH
directive.
Warning
For example, the following directive will not work the way you might think:
#SBATCH -o myMPI.o${SLURM_JOB_ID} # incorrect
Tip
Instead, use the following directive:
#SBATCH -o myMPI.o%j # "%j" expands to your job's numerical job ID
Similarly, you cannot use paths like $WORK
or $SCRATCH
in an #SBATCH
directive.
For more information on this and other matters related to Slurm job submission, see the Slurm online documentation; the man pages for both Slurm itself (man slurm
) and its individual commands (e.g. man sbatch
); as well as numerous other online resources.
Machine Learning
Vista is well equipped to provide researchers with the latest in Machine Learning frameworks, for example, PyTorch. The installation process will be a little different depending on whether you are using single or multiple nodes. Below we detail how to use PyTorch on our systems for both scenarios.
Running PyTorch (Single Node)
Using the System PyTorch
Follow these steps to use Vista's system PyTorch with a single GPU node.
-
Request a single compute node in Vista's
gh-dev
queue using theidev
utility:login1.vista(76)$ idev -p gh-dev -N 1 -n 1 -t 1:00:00
-
Load modules
c123-456$ module load gcc cuda c123-456$ module load python3
-
Launch Python interpreter and check to see that you can import PyTorch and that it can utilize the GPU nodes:
import torch torch.cuda.is_available()
Installing PyTorch
Depending on your particular application, you may also need to install your own local copy of PyTorch. We recommend using the Python virtual environment to manage machine learning packages. Below we detail how to install PyTorch on our systems with a virtual environment:
- Request a single compute node in Vista's
gh-dev
queue using theidev
utility:login1.vista(76)$ idev -p gh-dev -N 1 -n 1 -t 1:00:00
- Create a Python virtual environment:
c123-456$ module load gcc cuda c123-456$ module load python3 c123-456$ python3 -m venv /path/to/virtual-env-single-node # (e.g., $SCRATCH/python-envs/test)
- Activate the Python virtual environment:
c123-456$ source /path/to/virtual-env-single-node/bin/activate
- Install PyTorch
c123-456$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Testing PyTorch Installation
To test your installation of PyTorch we point you to a few benchmark calculations that are part of PyTorch's tutorials on multi-GPU and multi-node training. See PyTorch's documentation: Distributed Data Parallel in PyTorch. These tutorials include several scripts set up to run single-node training and multi-node training.
- Download the benchmark:
c123-456$ cd $SCRATCH (or directory on scratch where you want this repo to reside) c123-456$ git clone https://github.com/pytorch/examples.git
- Run the benchmark on one node (1 GPU):
c123-456$ python3 examples/distributed/ddp-tutorial-series/single_gpu.py 50 10
Running PyTorch (Multi-node)
To run multi-node jobs with Grace Hopper nodes on Vista you will need to use MPI-enabled Python. Follow these instructions to install and test these environments with MPI-enabled Python.
Using System PyTorch
Follow these steps to use Vista's system PyTorch with multiple GPU nodes.
- Request a single compute node in Vista's
gh-dev
queue using theidev
utility:login1.vista(76)$ idev -p gh-dev -N 2 -n 2 -t 1:00:00
-
Load modules
c123-456$ module load gcc cuda c123-456$ module load python3_mpi
-
Launch Python interpreter and check to see that you can import PyTorch and that it can utilize the GPU nodes:
import torch torch.cuda.is_available()
Installing PyTorch
To run multi-node jobs with Grace Hopper nodes on Vista you will need to use MPI-enabled Python. Below we detail how to install PyTorch with MPI-enabled Python using a virtual environment:
-
Request two nodes in the
gh-dev
queue using theidev
utility:idev -N 2 -n 2 -p gh-dev -t 01:00:00
-
Create a Python virtual environment:
c123-456$ module load gcc cuda c123-456$ module load python3_mpi c123-456$ python3 -m venv /path/to/virtual-env-single-node # (e.g., $SCRATCH/python-envs/test)
-
Activate the Python virtual environment:
c123-456$ source /path/to/virtual-env-single-node/bin/activate
-
Now install PyTorch:
c123-456$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
Testing PyTorch Installation
To test your installation of multi-node PyTorch we supply a simple script below. To launch this script run the following command:
c123-456$ ibrun -np 2 python3 test.py c123-456
Python script ("test.py")
import os
import argparse
from mpi4py import MPI
import torch
import torch.distributed as dist
# use mpi4py to get the world size and tasks rank
WORLD_SIZE = MPI.COMM_WORLD.Get_size()
WORLD_RANK = MPI.COMM_WORLD.Get_rank()
# use the convention that gets the local rank based on how many
# GPUs there are on the node.
GPU_ID = WORLD_RANK % torch.cuda.device_count()
name = MPI.Get_processor_name()
def run(backend):
tensor = torch.randn(10000,10000)
# Need to put tensor on a GPU device for nccl backend
if backend == 'nccl':
device = torch.device("cuda:{}".format(GPU_ID))
tensor = tensor.to(device)
print("Starting process on " + name+ ":" +torch.cuda.get_device_name(GPU_ID))
if WORLD_RANK == 0:
for rank_recv in range(1, WORLD_SIZE):
dist.send(tensor=tensor, dst=rank_recv)
print('worker_{} sent data to Rank {}\n'.format(0, rank_recv))
else:
dist.recv(tensor=tensor, src=0)
print('worker_{} has received data from rank {}\n'.format(WORLD_RANK,0))
def init_processes(backend, master_address):
print("World Rank: %s, World Size: %s, GPU_ID: %s"%(WORLD_RANK,WORLD_SIZE,GPU_ID))
os.environ["MASTER_ADDR"] = master_address
os.environ["MASTER_PORT"] = "12355"
dist.init_process_group(backend, rank=WORLD_RANK, world_size=WORLD_SIZE)
run(backend)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("master_node", type=str)
parser.add_argument("--backend", type=str, default="nccl", choices=['nccl', 'gloo'])
args = parser.parse_args()
backend=args.backend
if torch.cuda.device_count() == 0:
print("No gpu detected...switching to gloo for backend")
backend="gloo"
init_processes(backend=backend,master_address=args.master_node)
dist.destroy_process_group()
Building Software
The phrase "building software" is a common way to describe the process of producing a machine-readable executable file from source files written in C, Fortran, CUDA, or some other programming language. In its simplest form, building software involves a simple, one-line call or short shell script that invokes a compiler. More typically, the process leverages the power of makefiles, so you can change a line or two in the source code, then rebuild in a systematic way only the components affected by the change. Increasingly, however, the build process is a sophisticated multi-step automated workflow managed by a special framework like autotools or cmake, intended to achieve a repeatable, maintainable, portable mechanism for installing software across a wide range of target platforms.
Important
TACC maintains a database of currently installed software packages and libraries across all HPC resources. Navigate to TACC's Software List to see where, or if, a particular package is already installed on a particular resource.
This section of the user guide does nothing more than introduce the big ideas with simple one-line examples. You will undoubtedly want to explore these concepts more deeply using online resources. You will quickly outgrow the examples here. We recommend that you master the basics of CMake and/or makefiles as quickly as possible: even the simplest computational research project will benefit enormously from the power and flexibility of a CMakefile or makefile-based build process.
NVIDIA Compilers
NVIDIA is the recommended and default compiler suite on Vista.
Here are simple examples that use the NVIDIA compiler to build an executable from source code:
$ nvc mycode.c # C source file; executable a.out
$ nvc main.c calc.c analyze.c # multiple source files
$ nvc mycode.c -o myexe # C source file; executable myexe
$ nvcpc mycode.cpp -o myexe # C++ source file
$ nvfortran mycode.f90 -o myexe # Fortran90 source file
Compiling a code that uses OpenMP would look like this:
$ nvc -openmp mycode.c -o myexe # OpenMP
See the published NVIDIA documentation, available online at https://docs.nvidia.com/hpc-sdk//index.html.
GNU Compilers
The GNU foundation maintains a number of high quality compilers, including a compiler for C (gcc
), C++ (g++
), and Fortran (gfortran
). The gcc
compiler is the foundation underneath all three, and the term "gcc
" often means the suite of these three GNU compilers.
Load a GCC module to access a recent version of the GNU compiler suite. Avoid using the GNU compilers that are available without a gcc
module — those will be older versions based on the "system GCC" that comes as part of the Linux distribution.
Here are simple examples that use the GNU compilers to produce an executable from source code:
$ gcc mycode.c # C source file; executable a.out
$ gcc mycode.c -o myexe # C source file; executable myexe
$ g++ mycode.cpp -o myexe # C++ source file
$ gfortran mycode.f90 -o myexe # Fortran90 source file
$ gcc -fopenmp mycode.c -o myexe # OpenMP
Note that some compiler options are the same for both NVIDIA and GNU (e.g. -o
), while others are different (e.g. -openmp
vs -fopenmp
). Many options are available in one compiler suite but not the other. See the online GNU documentation for information on optimization flags and other GNU compiler options.
Compiling and Linking
Building an executable requires two separate steps: (1) compiling (generating a binary object file associated with each source file); and (2) linking (combining those object files into a single executable file that also specifies the libraries that executable needs). The examples in the previous section accomplish these two steps in a single call to the compiler. When building more sophisticated applications or libraries, however, it is often necessary or helpful to accomplish these two steps separately.
Use the -c
("compile") flag to produce object files from source files:
$ nvc -c main.c calc.c results.c
Barring errors, this command will produce object files main.o, calc.o, and results.o. Syntax for the NVIDIA and GNU compilers is similar. You can now link the object files to produce an executable file:
$ nvc main.o calc.o results.o -o myexe
The compiler calls a linker utility (usually /bin/ld
) to accomplish this task. Again, syntax for other compilers is similar.
Include and Library Paths
Software often depends on pre-compiled binaries called libraries. When this is true, compiling usually requires using the -I
option to specify paths to so-called header or include files that define interfaces to the procedures and data in those libraries. Similarly, linking often requires using the -L
option to specify paths to the libraries themselves. Typical compile and link lines might look like this:
$ nvc -c main.c -I${WORK}/mylib/inc -I${TACC_HDF5_INC} # compile
$ nvc main.o -o myexe -L${WORK}/mylib/lib -L${TACC_HDF5_LIB} -lmylib -lhdf5 # link
On Vista, both the HDF5 and PHDF5 modules define the environment variables $TACC_HDF5_INC
and $TACC_HDF5_LIB
. Other module files define similar environment variables; see Using Modules to Manage Your Environment for more information.
The details of the linking process vary, and order sometimes matters. Much depends on the type of library: static (.a suffix; library's binary code becomes part of executable image at link time) versus dynamically-linked shared (.so suffix; library's binary code is not part of executable; it's located and loaded into memory at run time). However, the $LD_LIBRARY_PATH
environment variable specifies the search path for dynamic libraries. For software installed at the system-level, TACC's modules generally modify the $LD_LIBRARY_PATH
automatically. To see whether and how an executable named myexe resolves dependencies on dynamically linked libraries, execute ldd myexe.
MPI Programs
OpenMPI (module ompi
) and MVAPICH (module mvapich
) are the two MPI libraries available on Vista. After loading an ompi
or mvapich2
module, compile and/or link using an MPI wrapper (mpicc
, mpicxx
, mpif90
) in place of the compiler:
$ mpicc mycode.c -o myexe # C source, full build
$ mpicc -c mycode.c # C source, compile without linking
$ mpicxx mycode.cpp -o myexe # C++ source, full build
$ mpif90 mycode.f90 -o myexe # Fortran source, full build
These wrappers call the compiler with the options, include paths, and libraries necessary to produce an MPI executable using the MPI module you're using. To see the effect of a given wrapper, call it with the -show
option:
$ mpicc -show # Show compile line generated by call to mpicc; similarly for other wrappers
Third-Party Software
See Building Third-Party Software in the Software at TACC guide.
Building for Performance
Compiler Options
When building software on Vista, we recommend using the most recent NVIDIA compiler and OpenMPI library available on Vista. The most recent versions may be newer than the defaults. Execute module spider nvidia and module spider ompi to see what's installed. When loading these modules you may need to specify version numbers explicitly (e.g. module load nvidia/24.5
and module load ompi/5.0
).
Architecture-Specific Flags
The Grace architecture is based on an Arm design that uses Neoverse V2 cores The Neovers V2 core support Arm’s Scalable Vector Extension v2(SVE2) and Advanced SIMD(NEON) technologies. Each core has four 128-bit functional units that support 8 64-bit FMA operations. To compile for this specific architecture, include the -tp neoverse-v2 compile option.
Normally, we do not recommend using the -fast
option. But, in this case, since there is only one chip architecture on Vista, and -fast
does not enforce -static
, it is safe to use the -fast
option with the NVIDIA compilers. It will enable optimizations for the Neoverse V2 architecture.
You can also use the environment variable $TACC_VEC_FLAGS
. This environment variable sets the following flags:
-Mvect=simd -fast -Mipa=fast,inline
If you use GNU compilers, you can optimize for the Grace architecture using the -mcpu=neoverse-v2 option. You can also use TACC_VEC_FLAGS as with the NVIDIA compilers. It enables the following flags:
-O3 -mcpu-neoverse-v2
NVIDIA Performance Libraries (NVPL)
The NVIDIA Performance Libraries (NVPL) are a collection of high-performance mathematical libraries optimized for the NVIDIA Grace Armv9.0 architecture. These CPU-only libraries are for standard C and Fortran mathematical APIs allowing HPC applications to achieve maximum performance on the Grace platform. The collection includes:
NVIDIA Documentation
- NVPL BLAS Documentation
- NVPL FFT Documentation
- NVPL LAPACK Documentation
- NVPL RAND Documentation
- NVPL ScaLAPACK Documentation
- NVPL Sparse Documentation
- NVPL Tensor Documentation
Consult the above documents for the details of each library and its API for building and linking codes. The libraries work with both NVHPC and GCC compilers, as well as their corresponding MPI wrappers. All libraries support the OpenMP runtime libraries. Refer to individual libraries documentation for details and API extensions supporting nested parallelism.
Compiler Examples
Example: A compile/link process on Vista may look like the following:
This links the code against the NVPL FFT library using the GNU g++
compiler.
The features in NVPL FFT are still evolving, please pay close attention and follow the latest NVPL FFT document.
$ module load nvpl
$ g++ mycode.cpp -I$TACC_NVPL_DIR}/include \
-L$TACC_NVPL_DIR}/lib \
-lnvpl_fftw \
-o myprogram
Example: This links the code against the NVPL OpenMP threaded BLAS, LAPACK, and SCALAPACK libraries of 32 bit integer interface using the NVHPC mpif90 wrapper. The cluster capability of BLAS from current NVPL release from NVHPC SDK-24.5 includes openmpi3,4,5 and mpich, choose the one that matches the MPI version in mpif90.
$ module load nvpl
$ mpif90 -mp -I$TACC_NVPL_DIR}/include \
-L${TACC_NVPL_DIR}/lib \
-lnvpl_blas_lp64_gomp \
-lnvpl_lapack_lp64_gomp \
-lnvpl_blacs_lp64_openmpi5 \
-lnvpl_scalapack_lp64 \
mycode.f90
When linking using NVHPC compiler, convenient flags -Mnvpl
and -Mscalapack
are provided. As the behavior of these flags may change during active development, please refer to the latest NVHPC compiler guide for more details.
Using NVPL as BLAS/LAPACK with Third-Party Software
When your third-party software requires BLAS or LAPACK, we recommend that you use NVPL to supply this functionality. Replace generic instructions that include link options like -lblas
or -llapack
with the NVPL approach described above. Generally there is no need to download and install alternatives like OpenBLAS. However, since the NVPL is a relatively new math library suite targeting the aarch64, its interoperability to other softwares with a special 32 or 64 bit integer interface, or OpenMP runting support are not fully tested yet. If you have issues with NVPL and alternative BLAS, LAPACK libraries are needed, the OpenBLAS based ones are available as a part of NVHPC compiler libraries.
Controlling Threading in NVPL
All NVPL libraries support the both GCC and NVHPC OpenMP runtime libraries. See individual libraries documentation for details and API extensions supporting nested parallelism. NVPL Libraries do not explicitly link any particular OpenMP runtime, they rely on runtime loading of the OpenMP library as determined by the application and environment. Applications linked to NVPL should always use at runtime the same OpenMP distribution the application was compiled with. Mixing OpenMP distributions from compile-time to runtime may result in anomalous performance. Please note that the default library linked with -Mnvpl
flag is single threaded as of NVHPC 24.5, -mpflag
is needed to linked with the threaded version.
NVIDIA HPC modules provide a libgomp.so
symlink to libnvomp.so
. This symlink will be on LD_LIBRARY_PATH
if NVHPC environment modules are loaded. Use ldd to ensure that applications built with GCC do not accidentally load libgomp.sosymlink
from HPC SDK due to LD_LIBRARY_PATH
. Use libnvomp.soif
if and only if the application was built with NVHPC compilers.
$OMP_NUM_THREADS
defaults to 1 on TACC systems. If you use the default value you will get no thread-based parallelism from NVPL. Setting the environment variable $OMP_NUM_THREADS
to control the number of threads for optimal performance.
Using NVPL with other MATLAB, PYTHON and R
TACC MATLAB, Python and R modules need BLAS and LAPACK and other math libraries for performance. How to use NVPL with them is under investigation. We will update.
Help Desk
Important
Submit a help desk ticket at any time via the TACC User Portal. Be sure to include "Vista" in the Resource field.
TACC Consulting operates from 8am to 5pm CST, Monday through Friday, except for holidays. Help the consulting staff help you by following these best practices when submitting tickets.
-
Do your homework before submitting a help desk ticket. What does the user guide and other documentation say? Search the internet for key phrases in your error logs; that's probably what the consultants answering your ticket are going to do. What have you changed since the last time your job succeeded?
-
Describe your issue as precisely and completely as you can: what you did, what happened, verbatim error messages, other meaningful output.
Tip
When appropriate, include as much meta-information about your job and workflow as possible including:
- directory containing your build and/or job script
- all modules loaded
- relevant job IDs
- any recent changes in your workflow that could affect or explain the behavior you're observing.
-
Subscribe to Vista User News. This is the best way to keep abreast of maintenance schedules, system outages, and other general interest items.
-
Have realistic expectations. Consultants can address system issues and answer questions about Vista. But they can't teach parallel programming in a ticket, and may know nothing about the package you downloaded. They may offer general advice that will help you build, debug, optimize, or modify your code, but you shouldn't expect them to do these things for you.
-
Be patient. It may take a business day for a consultant to get back to you, especially if your issue is complex. It might take an exchange or two before you and the consultant are on the same page. If the admins disable your account, it's not punitive. When the file system is in danger of crashing, or a login node hangs, they don't have time to notify you before taking action.