Vista User Guide
Last update: September 9, 2024
Notices
- Subscribe to Vista User News. Stay up-to-date on Vista's status, scheduled maintenances and other notifications.
Introduction
Vista is funded by the National Science Foundation (NSF) via a supplement to the Computing for the Endless Frontier award, Award Abstract #1818253. Vista expands the Frontera project's support of Machine Learning and GPU-enabled applications with a system based on NVIDIA Grace Hopper architecture and provides a path to more power efficient computing with NVIDIA's Grace Grace ARM CPUs.
The Grace Hopper Superchip introduces a novel architecture that combines the GPU and CPU in one module. This technology removes the bottleneck of the PCIe bus by connecting the CPU and GPU directly with NVLINK and exposing the CPU and GPU memory space as separate NUMA nodes. This allows the programmer to easily access CPU or GPU memory from either device. This greatly reduces the programming complexity of GPU programs while providing increased bandwidth and reduced latency between CPU and GPU.
The Grace Superchip connects two 72 core Grace CPUs using the same NVLINK technology used in the Grace Hopper Superchip to provide 144 ARM cores in 2 NUMA nodes. Using LPDDR memory, each Superchip offers over 850 GiB/s of memory bandwidth and up to 7 TFlops of double precision performance.
Allocations
Coming soon.
System Architecture
Grace Grace Compute Nodes
Vista hosts 256 "Grace Grace” (GG) nodes with 144 cores each. Each GG node provides a performance increase of 1.5 - 2x over the Stampede3's CLX nodes due to increased core count and increased memory bandwidth. Each GG node provides over 7 TFlops of double precision performance and 850 GiB/s of memory bandwidth.
Table 1. GG Specifications
Specification | Value |
---|---|
CPU: | NVIDIA Grace CPU Superchip |
Total cores per node: | 144 cores on two sockets (2 x 72 cores) |
Hardware threads per core: | 1 |
Hardware threads per node: | 2x72 = 144 |
Clock rate: | 3.4 GHz |
Memory: | 237 GB LPDDR |
Cache: | 64 KB L1 data cache per core; 1MB L2 per core; 114 MB L3 per socket. Each socket can cache up to 186 MB (sum of L2 and L3 capacity). |
Local storage: | 286 GB /tmp partition |
Grace Hopper Compute Nodes
Vista hosts 600 Grace Hopper (GH) nodes. Each GH node has one H100 GPU with 96 GB of HBM3 memory and one Grace CPU with 116 GB of LPDDR memory. The GH node provides 34 TFlops of FP64 performance and 1979 TFlops of FP16 performance for ML workflows on the H100 chip.
Table 2. GH Specifications
Specification | Value |
---|---|
GPU: | NVIDIA H100 GPU |
GPU Memory: | 96 GB HBM 3 |
CPU: | NVIDIA Grace CPU |
Total cores per node: | 72 cores on one socket |
Hardware threads per core: | 1 |
Hardware threads per node: | 1x48 = 72 |
Clock rate: | 3.1 GHz |
Memory: | 116 GB DDR5 |
Cache: | 64 KB L1 data cache per core; 1MB L2 per core; 114 MB L3 per socket. Each socket can cache up to 186 MB (sum of L2 and L3 capacity). |
Local storage: | 286 GB /tmp partition |
Login Nodes
The Vista login nodes are NVIDIA Grace Grace (GG) nodes, each with 144 cores on two sockets (72 cores/socket) with 237 GB of LPDDR.
Network
The interconnect is based on Mellanox NDR technology with full NDR (400 Gb/s) connectivity between the switches and the GH GPU nodes and with NDR200 (200 Gb/s) connectivity to the GG compute nodes. A fat tree topology connects the compute nodes and the GPU nodes within separate trees. Both sets of nodes are connected with NDR to the $HOME
and $SCRATCH
file systems.
File Systems
Vista will use a shared VAST file system for the $HOME
and $SCRATCH
directories.
Important
Vista's $HOME
and $SCRATCH
file systems are NOT Lustre file systems and do not support setting a stripe count or stripe size.
As with Stampede3, the $WORK
file system will also be mounted. Unlike $HOME
and $SCRATCH
, the $WORK
file system is a Lustre file system and supports Lustre's lfs
commands. All three file systems, $HOME
, $SCRATCH
, and $WORK
are available from all Vista nodes. The /tmp
partition is also available to users but is local to each node. The $WORK
file system is available on most other TACC HPC systems as well.
Table 3. File Systems
File System | Type | Quota | Key Features |
---|---|---|---|
$HOME |
VAST | 23 GB, 500,000 files | Not intended for parallel or high−intensity file operations. Backed up regularly. |
$WORK |
Lustre | 1 TB, 3,000,000 files across all TACC systems Not intended for parallel or high−intensity file operations. See Stockyard system description for more information. |
Not backed up. |
$SCRATCH |
VAST | no quota Overall capacity ~10 PB. |
Not backed up. Files are subject to purge if access time* is more than 10 days old. See TACC's Scratch File System Purge Policy below. |
Scratch File System Purge Policy
Warning
The $SCRATCH
file system, as its name indicates, is a temporary storage space. Files that have not been accessed* in ten days are subject to purge. Deliberately modifying file access time (using any method, tool, or program) for the purpose of circumventing purge policies is prohibited.
*The operating system updates a file's access time when that file is modified on a login or compute node. Reading or executing a file/script on a login node does not update the access time, but reading or executing on a compute node does update the access time. This approach helps us distinguish between routine management tasks (e.g. tar
, scp
) and production use. Use the command ls -ul
to view access times.
Running Jobs
Slurm Partitions (Queues)
Vista's job scheduler is the Slurm Workload Manager. Slurm commands enable you to submit, manage, monitor, and control your jobs.
Important
Queue limits are subject to change without notice.
TACC Staff will occasionally adjust the QOS settings in order to ensure fair scheduling for the entire user community.
Use TACC's qlimits
utility to see the latest queue configurations.
Table 4. Production Queues
Queue Name | Node Type | Max Nodes per Job (assoc'd cores) |
Max Duration | Max Jobs in Queue | Charge Rate (per node-hour) |
---|---|---|---|---|---|
gg |
Grace/Grace | 32 nodes (4608 cores) |
48 hrs | 20 | 0.33 SU |
gh |
Grace/Hopper | 64 nodes (4608 cores/64 gpus) |
48 hrs | 20 | 1 SUs |
gh-dev |
Grace Hopper | 8 nodes (576 cores) |
2 hrs | 8 | 1 SU |
Job Accounting
Like all TACC systems, Vista's accounting system is based on node-hours: one unadjusted Service Unit (SU) represents a single compute node used for one hour (a node-hour). For any given job, the total cost in SUs is the use of one compute node for one hour of wall clock time plus any charges or discounts for the use of specialized queues, e.g. Stampede3's pvc
queue, Lonestar6's gpu-a100
queue, and Frontera's flex
queue. The queue charge rates are determined by the supply and demand for that particular queue or type of node used and are subject to change.
Vista SUs billed = (# nodes) x (job duration in wall clock hours) x (charge rate per node-hour)
The Slurm scheduler tracks and charges for usage to a granularity of a few seconds of wall clock time. The system charges only for the resources you actually use, not those you request. If your job finishes early and exits properly, Slurm will release the nodes back into the pool of available nodes. Your job will only be charged for as long as you are using the nodes.
Note
TACC does not implement node-sharing on any compute resource. Each Vista node can be assigned to only one user at a time; hence a complete node is dedicated to a user's job and accrues wall-clock time for all the node's cores whether or not all cores are used.
Principal Investigators can monitor allocation usage via the TACC User Portal under "Allocations->Projects and Allocations". Be aware that the figures shown on the portal may lag behind the most recent usage. Projects and allocation balances are also displayed upon command-line login.
Tip
To display a summary of your TACC project balances and disk quotas at any time, execute:login1$ /usr/local/etc/taccinfo # Generally more current than balances displayed on the portals.
Submitting Batch Jobs with sbatch
Use Slurm's sbatch
command to submit a batch job to one of the Vista queues:
login1$ sbatch myjobscript
Where myjobscript
is the name of a text file containing #SBATCH
directives and shell commands that describe the particulars of the job you are submitting. The details of your job script's contents depend on the type of job you intend to run.
In your job script you (1) use #SBATCH
directives to request computing resources (e.g. 10 nodes for 2 hrs); and then (2) use shell commands to specify what work you're going to do once your job begins. There are many possibilities: you might elect to launch a single application, or you might want to accomplish several steps in a workflow. You may even choose to launch more than one application at the same time. The details will vary, and there are many possibilities. But your own job script will probably include at least one launch line that is a variation of one of the examples described here.
Your job will run in the environment it inherits at submission time; this environment includes the modules you have loaded and the current working directory. In most cases you should run your applications(s) after loading the same modules that you used to build them. You can of course use your job submission script to modify this environment by defining new environment variables; changing the values of existing environment variables; loading or unloading modules; changing directory; or specifying relative or absolute paths to files. Do not use the Slurm --export
option to manage your job's environment: doing so can interfere with the way the system propagates the inherited environment.
Table 8. below describes some of the most common sbatch
command options. Slurm directives begin with #SBATCH
; most have a short form (e.g. -N
) and a long form (e.g. --nodes
). You can pass options to sbatch
using either the command line or job script; most users find that the job script is the easier approach. The first line of your job script must specify the interpreter that will parse non-Slurm commands; in most cases #!/bin/bash
or #!/bin/csh
is the right choice. Avoid #!/bin/sh
(its startup behavior can lead to subtle problems on Vista), and do not include comments or any other characters on this first line. All #SBATCH
directives must precede all shell commands. Note also that certain #SBATCH
options or combinations of options are mandatory, while others are not available on Vista.
By default, Slurm writes all console output to a file named "slurm-%j.out
", where %j
is the numerical job ID. To specify a different filename use the -o
option. To save stdout
(standard out) and stderr
(standard error) to separate files, specify both -o
and -e
options.
Tip
The maximum runtime for any individual job is 48 hours. However, if you have good checkpointing implemented, you can easily chain jobs such that the outputs of one job are the inputs of the next, effectively running indefinitely for as long as needed. See Slurm's -d
option.
Table 8. Common sbatch
Options
Option | Argument | Comments |
---|---|---|
-A |
projectid | Charge job to the specified project/allocation number. This option is only necessary for logins associated with multiple projects. |
-a or --array |
=tasklist | Vista supports Slurm job arrays. See the Slurm documentation on job arrays for more information. |
-d= |
afterok:jobid | Specifies a dependency: this run will start only after the specified job (jobid) successfully finishes |
-export= |
N/A | Avoid this option on Vista. Using it is rarely necessary and can interfere with the way the system propagates your environment. |
--gres |
TACC does not support this option. | |
--gpus-per-task |
TACC does not support this option. | |
-p |
queue_name | Submits to queue (partition) designated by queue_name |
-J |
job_name | Job Name |
-N |
total_nodes | Required. Define the resources you need by specifying either: (1) -N and -n ; or(2) -N and -ntasks-per-node . |
-n |
total_tasks | This is total MPI tasks in this job. See -N above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set it to the same value as -N . |
-ntasks-per-node or -tasks-per-node |
tasks_per_node | This is MPI tasks per node. See -N above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set -ntasks-per-node to 1. |
-t |
hh:mm:ss | Required. Wall clock time for job. |
-mail-type= |
begin , end , fail , or all |
Specify when user notifications are to be sent (one option per line). |
-mail-user= |
email_address | Specify the email address to use for notifications. Use with the -mail-type= flag above. |
-o |
output_file | Direct job standard output to output_file (without -e option error goes to this file) |
-e |
error_file | Direct job error output to error_file |
-mem |
N/A | Not available. If you attempt to use this option, the scheduler will not accept your job. |
Launching Applications
The primary purpose of your job script is to launch your research application. How you do so depends on several factors, especially (1) the type of application (e.g. MPI, OpenMP, serial), and (2) what you're trying to accomplish (e.g. launch a single instance, complete several steps in a workflow, run several applications simultaneously within the same job). While there are many possibilities, your own job script will probably include a launch line that is a variation of one of the examples described in this section:
Note that the following examples demonstrate launching within a Slurm job script or an idev
session. Do not launch jobs on the login nodes.
One Serial Application
To launch a serial application, simply call the executable. Specify the path to the executable in either the $PATH
environment variable or in the call to the executable itself:
myprogram # executable in a directory listed in $PATH
$WORK/apps/myprov/myprogram # explicit full path to executable
./myprogram # executable in current directory
./myprogram -m -k 6 input1 # executable with notional input options
One Multi-Threaded Application
Launch a threaded application the same way. Be sure to specify the number of threads. Note that the default OpenMP thread count is 1.
export OMP_NUM_THREADS=144 # 144 total OpenMP threads (1 per GG core)
./myprogram
One MPI Application
To launch an MPI application, use the TACC-specific MPI launcher ibrun
, which is a Vista-aware replacement for generic MPI launchers like mpirun
and mpiexec
. In most cases the only arguments you need are the name of your executable followed by any arguments your executable needs. When you call ibrun
without other arguments, your Slurm #SBATCH
directives will determine the number of ranks (MPI tasks) and number of nodes on which your program runs.
#SBATCH -N 4
#SBATCH -n 576
ibrun ./myprogram # ibrun uses the $SBATCH directives to properly allocate nodes and tasks
To use ibrun
interactively, say within an idev
session, you can specify:
login1$ idev -N 2 -n 80 -p gg
c123-456$ ibrun ./myprogram # ibrun uses idev's arguments to properly allocate nodes and tasks
One Hybrid (MPI+Threads) Application
When launching a single application you generally don't need to worry about affinity: both OpenMPI and MVAPICH2 will distribute and pin tasks and threads in a sensible way.
export OMP_NUM_THREADS=8 # 8 OpenMP threads per MPI rank
ibrun ./myprogram # use ibrun instead of mpirun or mpiexec
As a practical guideline, the product of $OMP_NUM_THREADS
and the maximum number of MPI processes per node should not be greater than total number of cores available per node (GG nodes have 144 cores, GH nodes have 72 cores).
MPI Applications - Consecutive
To run one MPI application after another (or any sequence of commands one at a time), simply list them in your job script in the order in which you'd like them to execute. When one application/command completes, the next one will begin.
module load git
module list
./preprocess.sh
ibrun ./myprogram input1 # runs after preprocess.sh completes
ibrun ./myprogram input2 # runs after previous MPI app completes
MPI Application - Concurrent
Coming soon.
More than One OpenMP Application Running Concurrently
You can also run more than one OpenMP application simultaneously on a single node, but you will need to distribute and pin tasks appropriately. In the example below, numactl -C specifies virtual CPUs (hardware threads). According to the numbering scheme for GG cores, CPU () numbers 0-143 are spread across the 144 cores, 1 thread per core.
export OMP_NUM_THREADS=2
numactl -C 0-1 ./myprogram inputfile1 & # HW threads (hence cores) 0-1. Note ampersand.
numactl -C 2-3 ./myprogram inputfile2 & # HW threads (hence cores) 2-3. Note ampersand.
wait
Interactive Sessions
Interactive Sessions with idev
and srun
TACC's own idev
utility is the best way to begin an interactive session on one or more compute nodes. To launch a thirty-minute session on a single node in the development queue, simply execute:
login1$ idev
You'll then see output that includes the following excerpts:
...
-----------------------------------------------------------------
Welcome to the Vista Supercomputer
-----------------------------------------------------------------
...
-> After your `idev` job begins to run, a command prompt will appear,
-> and you can begin your interactive development session.
-> We will report the job status every 4 seconds: (PD=pending, R=running).
->job status: PD
->job status: PD
...
c449-001$
The job status messages indicate that your interactive session is waiting in the queue. When your session begins, you'll see a command prompt on a compute node (in this case, the node with hostname c449-001). If this is the first time you launch idev
, the prompts may invite you to choose a default project and a default number of tasks per node for future idev
sessions.
For command line options and other information, execute idev --help
. It's easy to tailor your submission request (e.g. shorter or longer duration) using Slurm-like syntax:
login1$ idev -p gg -N 2 -n 8 -m 150 # gg queue, 2 nodes, 8 total tasks, 150 minutes
For more information see the idev
documentation.
Interactive Sessions using ssh
If you have a batch job or interactive session running on a compute node, you "own the node": you can connect via ssh to open a new interactive session on that node. This is an especially convenient way to monitor your applications' progress. One particularly helpful example: login to a compute node that you own, execute top, then press the "1" key to see a display that allows you to monitor thread ("CPU") and memory use.
There are many ways to determine the nodes on which you are running a job, including feedback messages following your sbatch submission, the compute node command prompt in an idev
session, and the squeue
or showq
utilities. The sequence of identifying your compute node then connecting to it would look like this:
login1$ squeue -u bjones
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
858811 skx-dev idv46796 bjones R 0:39 1 c448-004
1ogin1$ ssh c448-004
...
C448-004$
Slurm Environment Variables
Be sure to distinguish between internal Slurm replacement symbols (e.g. %j
described above) and Linux environment variables defined by Slurm (e.g. SLURM_JOBID
). Execute env | grep SLURM
from within your job script to see the full list of Slurm environment variables and their values. You can use Slurm replacement symbols like %j
only to construct a Slurm filename pattern; they are not meaningful to your Linux shell. Conversely, you can use Slurm environment variables in the shell portion of your job script but not in an #SBATCH
directive.
Warning
For example, the following directive will not work the way you might think:
#SBATCH -o myMPI.o${SLURM_JOB_ID} # incorrect
Tip
Instead, use the following directive:
#SBATCH -o myMPI.o%j # "%j" expands to your job's numerical job ID
Similarly, you cannot use paths like $WORK
or $SCRATCH
in an #SBATCH
directive.
For more information on this and other matters related to Slurm job submission, see the Slurm online documentation; the man pages for both Slurm itself (man slurm
) and its individual commands (e.g. man sbatch
); as well as numerous other online resources.
NVIDIA Performance Libraries (NVPL)
The NVIDIA Performance Libraries (NVPL) are a collection of high-performance mathematical libraries optimized for the NVIDIA Grace Armv9.0 architecture. These CPU-only libraries are for standard C and Fortran mathematical APIs allowing HPC applications to achieve maximum performance on the Grace platform. The collection includes:
NVIDIA Documentation
- NVPL BLAS Documentation
- NVPL FFT Documentation
- NVPL LAPACK Documentation
- NVPL RAND Documentation
- NVPL ScaLAPACK Documentation
- NVPL Sparse Documentation
- NVPL Tensor Documentation
Consult the above documents for the details of each library and its API for building and linking codes. The libraries work with both NVHPC and GCC compilers, as well as their corresponding MPI wrappers. All libraries support the OpenMP runtime libraries. Refer to individual libraries documentation for details and API extensions supporting nested parallelism.
Compiler Examples
Example: A compile/link process on Vista may look like the following:
This links the code against the NVPL FFT library using the GNU g++
compiler.
The features in NVPL FFT are still evolving, please pay close attention and follow the latest NVPL FFT document.
$ module load nvpl
$ g++ mycode.cpp -I$TACC_NVPL_DIR}/include \
-L$TACC_NVPL_DIR}/lib \
-lnvpl_fftw \
-o myprogram
Example: This links the code against the NVPL OpenMP threaded BLAS, LAPACK, and SCALAPACK libraries of 32 bit integer interface using the NVHPC mpif90 wrapper. The cluster capability of BLAS from current NVPL release from NVHPC SDK-24.5 includes openmpi3,4,5 and mpich, choose the one that matches the MPI version in mpif90.
$ module load nvpl
$ mpif90 -mp -I$TACC_NVPL_DIR}/include \
-L${TACC_NVPL_DIR}/lib \
-lnvpl_blas_lp64_gomp \
-lnvpl_lapack_lp64_gomp \
-lnvpl_blacs_lp64_openmpi5 \
-lnvpl_scalapack_lp64 \
mycode.f90
When linking using NVHPC compiler, convenient flags -Mnvpl
and -Mscalapack
are provided. As the behavior of these flags may change during active development, please refer to the latest NVHPC compiler guide for more details.
Using NVPL as BLAS/LAPACK with Third-Party Software
When your third-party software requires BLAS or LAPACK, we recommend that you use NVPL to supply this functionality. Replace generic instructions that include link options like -lblas
or -llapack
with the NVPL approach described above. Generally there is no need to download and install alternatives like OpenBLAS. However, since the NVPL is a relatively new math library suite targeting the aarch64, its interoperability to other softwares with a special 32 or 64 bit integer interface, or OpenMP runting support are not fully tested yet. If you have issues with NVPL and alternative BLAS, LAPACK libraries are needed, the OpenBLAS based ones are available as a part of NVHPC compiler libraries.
Controlling Threading in NVPL
All NVPL libraries support the both GCC and NVHPC OpenMP runtime libraries. See individual libraries documentation for details and API extensions supporting nested parallelism. NVPL Libraries do not explicitly link any particular OpenMP runtime, they rely on runtime loading of the OpenMP library as determined by the application and environment. Applications linked to NVPL should always use at runtime the same OpenMP distribution the application was compiled with. Mixing OpenMP distributions from compile-time to runtime may result in anomalous performance. Please note that the default library linked with -Mnvpl
flag is single threaded as of NVHPC 24.5, -mpflag
is needed to linked with the threaded version.
NVIDIA HPC modules provide a libgomp.so
symlink to libnvomp.so
. This symlink will be on LD_LIBRARY_PATH
if NVHPC environment modules are loaded. Use ldd to ensure that applications built with GCC do not accidentally load libgomp.sosymlink
from HPC SDK due to LD_LIBRARY_PATH
. Use libnvomp.soif
if and only if the application was built with NVHPC compilers.
$OMP_NUM_THREADS
defaults to 1 on TACC systems. If you use the default value you will get no thread-based parallelism from NVPL. Setting the environment variable $OMP_NUM_THREADS
to control the number of threads for optimal performance.
Using NVPL with other MATLAB, PYTHON and R
TACC MATLAB, Python and R modules need BLAS and LAPACK and other math libraries for performance. How to use NVPL with them is under investigation. We will update.
Help Desk
Important
Submit a help desk ticket at any time via the TACC User Portal. Be sure to include "Vista" in the Resource field.
TACC Consulting operates from 8am to 5pm CST, Monday through Friday, except for holidays. Help the consulting staff help you by following these best practices when submitting tickets.
-
Do your homework before submitting a help desk ticket. What does the user guide and other documentation say? Search the internet for key phrases in your error logs; that's probably what the consultants answering your ticket are going to do. What have you changed since the last time your job succeeded?
-
Describe your issue as precisely and completely as you can: what you did, what happened, verbatim error messages, other meaningful output.
Tip
When appropriate, include as much meta-information about your job and workflow as possible including:
- directory containing your build and/or job script
- all modules loaded
- relevant job IDs
- any recent changes in your workflow that could affect or explain the behavior you're observing.
-
Subscribe to Vista User News. This is the best way to keep abreast of maintenance schedules, system outages, and other general interest items.
-
Have realistic expectations. Consultants can address system issues and answer questions about Vista. But they can't teach parallel programming in a ticket, and may know nothing about the package you downloaded. They may offer general advice that will help you build, debug, optimize, or modify your code, but you shouldn't expect them to do these things for you.
-
Be patient. It may take a business day for a consultant to get back to you, especially if your issue is complex. It might take an exchange or two before you and the consultant are on the same page. If the admins disable your account, it's not punitive. When the file system is in danger of crashing, or a login node hangs, they don't have time to notify you before taking action.