Vista User Guide

Last update: September 9, 2024

Notices

  • Subscribe to Vista User News. Stay up-to-date on Vista's status, scheduled maintenances and other notifications.

Introduction

Vista is funded by the National Science Foundation (NSF) via a supplement to the Computing for the Endless Frontier award, Award Abstract #1818253. Vista expands the Frontera project's support of Machine Learning and GPU-enabled applications with a system based on NVIDIA Grace Hopper architecture and provides a path to more power efficient computing with NVIDIA's Grace Grace ARM CPUs.

The Grace Hopper Superchip introduces a novel architecture that combines the GPU and CPU in one module. This technology removes the bottleneck of the PCIe bus by connecting the CPU and GPU directly with NVLINK and exposing the CPU and GPU memory space as separate NUMA nodes. This allows the programmer to easily access CPU or GPU memory from either device. This greatly reduces the programming complexity of GPU programs while providing increased bandwidth and reduced latency between CPU and GPU.

The Grace Superchip connects two 72 core Grace CPUs using the same NVLINK technology used in the Grace Hopper Superchip to provide 144 ARM cores in 2 NUMA nodes. Using LPDDR memory, each Superchip offers over 850 GiB/s of memory bandwidth and up to 7 TFlops of double precision performance.

Allocations

Coming soon.

System Architecture

Grace Grace Compute Nodes

Vista hosts 256 "Grace Grace” (GG) nodes with 144 cores each. Each GG node provides a performance increase of 1.5 - 2x over the Stampede3's CLX nodes due to increased core count and increased memory bandwidth. Each GG node provides over 7 TFlops of double precision performance and 850 GiB/s of memory bandwidth.

Table 1. GG Specifications

Specification Value
CPU: NVIDIA Grace CPU Superchip
Total cores per node: 144 cores on two sockets (2 x 72 cores)
Hardware threads per core: 1
Hardware threads per node: 2x72 = 144
Clock rate: 3.4 GHz
Memory: 237 GB LPDDR
Cache: 64 KB L1 data cache per core; 1MB L2 per core; 114 MB L3 per socket.
Each socket can cache up to 186 MB (sum of L2 and L3 capacity).
Local storage: 286 GB /tmp partition

Grace Hopper Compute Nodes

Vista hosts 600 Grace Hopper (GH) nodes. Each GH node has one H100 GPU with 96 GB of HBM3 memory and one Grace CPU with 116 GB of LPDDR memory. The GH node provides 34 TFlops of FP64 performance and 1979 TFlops of FP16 performance for ML workflows on the H100 chip.

Table 2. GH Specifications

Specification Value
GPU: NVIDIA H100 GPU
GPU Memory: 96 GB HBM 3
CPU: NVIDIA Grace CPU
Total cores per node: 72 cores on one socket
Hardware threads per core: 1
Hardware threads per node: 1x48 = 72
Clock rate: 3.1 GHz
Memory: 116 GB DDR5
Cache: 64 KB L1 data cache per core; 1MB L2 per core; 114 MB L3 per socket.
Each socket can cache up to 186 MB (sum of L2 and L3 capacity).
Local storage: 286 GB /tmp partition

Login Nodes

The Vista login nodes are NVIDIA Grace Grace (GG) nodes, each with 144 cores on two sockets (72 cores/socket) with 237 GB of LPDDR.

Network

The interconnect is based on Mellanox NDR technology with full NDR (400 Gb/s) connectivity between the switches and the GH GPU nodes and with NDR200 (200 Gb/s) connectivity to the GG compute nodes. A fat tree topology connects the compute nodes and the GPU nodes within separate trees. Both sets of nodes are connected with NDR to the $HOME and $SCRATCH file systems.

File Systems

Vista will use a shared VAST file system for the $HOME and $SCRATCH directories.

Important

Vista's $HOME and $SCRATCH file systems are NOT Lustre file systems and do not support setting a stripe count or stripe size.

As with Stampede3, the $WORK file system will also be mounted. Unlike $HOME and $SCRATCH, the $WORK file system is a Lustre file system and supports Lustre's lfs commands. All three file systems, $HOME, $SCRATCH, and $WORK are available from all Vista nodes. The /tmp partition is also available to users but is local to each node. The $WORK file system is available on most other TACC HPC systems as well.

Table 3. File Systems

File System Type Quota Key Features
$HOME VAST 23 GB, 500,000 files Not intended for parallel or high−intensity file operations.
Backed up regularly.
$WORK Lustre 1 TB, 3,000,000 files across all TACC systems
Not intended for parallel or high−intensity file operations.
See Stockyard system description for more information. 
Not backed up.
$SCRATCH VAST no quota
Overall capacity ~10 PB.
Not backed up.
Files are subject to purge if access time* is more than 10 days old. See TACC's Scratch File System Purge Policy below.

Scratch File System Purge Policy

Warning

The $SCRATCH file system, as its name indicates, is a temporary storage space. Files that have not been accessed* in ten days are subject to purge. Deliberately modifying file access time (using any method, tool, or program) for the purpose of circumventing purge policies is prohibited.

*The operating system updates a file's access time when that file is modified on a login or compute node. Reading or executing a file/script on a login node does not update the access time, but reading or executing on a compute node does update the access time. This approach helps us distinguish between routine management tasks (e.g. tar, scp) and production use. Use the command ls -ul to view access times.

Running Jobs

Slurm Partitions (Queues)

Vista's job scheduler is the Slurm Workload Manager. Slurm commands enable you to submit, manage, monitor, and control your jobs.

Important

Queue limits are subject to change without notice.
TACC Staff will occasionally adjust the QOS settings in order to ensure fair scheduling for the entire user community.
Use TACC's qlimits utility to see the latest queue configurations.

Table 4. Production Queues

Queue Name Node Type Max Nodes per Job
(assoc'd cores)
Max Duration Max Jobs in Queue Charge Rate
(per node-hour)
gg Grace/Grace 32 nodes
(4608 cores)
48 hrs 20 0.33 SU
gh Grace/Hopper 64 nodes
(4608 cores/64 gpus)
48 hrs 20 1 SUs
gh-dev Grace Hopper 8 nodes
(576 cores)
2 hrs 8 1 SU

Job Accounting

Like all TACC systems, Vista's accounting system is based on node-hours: one unadjusted Service Unit (SU) represents a single compute node used for one hour (a node-hour). For any given job, the total cost in SUs is the use of one compute node for one hour of wall clock time plus any charges or discounts for the use of specialized queues, e.g. Stampede3's pvc queue, Lonestar6's gpu-a100 queue, and Frontera's flex queue. The queue charge rates are determined by the supply and demand for that particular queue or type of node used and are subject to change.

Vista SUs billed = (# nodes) x (job duration in wall clock hours) x (charge rate per node-hour)

The Slurm scheduler tracks and charges for usage to a granularity of a few seconds of wall clock time. The system charges only for the resources you actually use, not those you request. If your job finishes early and exits properly, Slurm will release the nodes back into the pool of available nodes. Your job will only be charged for as long as you are using the nodes.

Note

TACC does not implement node-sharing on any compute resource. Each Vista node can be assigned to only one user at a time; hence a complete node is dedicated to a user's job and accrues wall-clock time for all the node's cores whether or not all cores are used.

Principal Investigators can monitor allocation usage via the TACC User Portal under "Allocations->Projects and Allocations". Be aware that the figures shown on the portal may lag behind the most recent usage. Projects and allocation balances are also displayed upon command-line login.

Tip

To display a summary of your TACC project balances and disk quotas at any time, execute:

login1$ /usr/local/etc/taccinfo        # Generally more current than balances displayed on the portals.

Submitting Batch Jobs with sbatch

Use Slurm's sbatch command to submit a batch job to one of the Vista queues:

login1$ sbatch myjobscript

Where myjobscript is the name of a text file containing #SBATCH directives and shell commands that describe the particulars of the job you are submitting. The details of your job script's contents depend on the type of job you intend to run.

In your job script you (1) use #SBATCH directives to request computing resources (e.g. 10 nodes for 2 hrs); and then (2) use shell commands to specify what work you're going to do once your job begins. There are many possibilities: you might elect to launch a single application, or you might want to accomplish several steps in a workflow. You may even choose to launch more than one application at the same time. The details will vary, and there are many possibilities. But your own job script will probably include at least one launch line that is a variation of one of the examples described here.

Your job will run in the environment it inherits at submission time; this environment includes the modules you have loaded and the current working directory. In most cases you should run your applications(s) after loading the same modules that you used to build them. You can of course use your job submission script to modify this environment by defining new environment variables; changing the values of existing environment variables; loading or unloading modules; changing directory; or specifying relative or absolute paths to files. Do not use the Slurm --export option to manage your job's environment: doing so can interfere with the way the system propagates the inherited environment.

Table 8. below describes some of the most common sbatch command options. Slurm directives begin with #SBATCH; most have a short form (e.g. -N) and a long form (e.g. --nodes). You can pass options to sbatch using either the command line or job script; most users find that the job script is the easier approach. The first line of your job script must specify the interpreter that will parse non-Slurm commands; in most cases #!/bin/bash or #!/bin/csh is the right choice. Avoid #!/bin/sh (its startup behavior can lead to subtle problems on Vista), and do not include comments or any other characters on this first line. All #SBATCH directives must precede all shell commands. Note also that certain #SBATCH options or combinations of options are mandatory, while others are not available on Vista.

By default, Slurm writes all console output to a file named "slurm-%j.out", where %j is the numerical job ID. To specify a different filename use the -o option. To save stdout (standard out) and stderr (standard error) to separate files, specify both -o and -e options.

Tip

The maximum runtime for any individual job is 48 hours. However, if you have good checkpointing implemented, you can easily chain jobs such that the outputs of one job are the inputs of the next, effectively running indefinitely for as long as needed. See Slurm's -d option.

Table 8. Common sbatch Options

Option Argument Comments
-A projectid Charge job to the specified project/allocation number. This option is only necessary for logins associated with multiple projects.
-a
or
--array
=tasklist Vista supports Slurm job arrays. See the Slurm documentation on job arrays for more information.
-d= afterok:jobid Specifies a dependency: this run will start only after the specified job (jobid) successfully finishes
-export= N/A Avoid this option on Vista. Using it is rarely necessary and can interfere with the way the system propagates your environment.
--gres TACC does not support this option.
--gpus-per-task TACC does not support this option.
-p queue_name Submits to queue (partition) designated by queue_name
-J job_name Job Name
-N total_nodes Required. Define the resources you need by specifying either:
(1) -N and -n; or
(2) -N and -ntasks-per-node.
-n total_tasks This is total MPI tasks in this job. See -N above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set it to the same value as -N.
-ntasks-per-node
or
-tasks-per-node
tasks_per_node This is MPI tasks per node. See -N above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set -ntasks-per-node to 1.
-t hh:mm:ss Required. Wall clock time for job.
-mail-type= begin, end, fail, or all Specify when user notifications are to be sent (one option per line).
-mail-user= email_address Specify the email address to use for notifications. Use with the -mail-type= flag above.
-o output_file Direct job standard output to output_file (without -e option error goes to this file)
-e error_file Direct job error output to error_file
-mem N/A Not available. If you attempt to use this option, the scheduler will not accept your job.

Launching Applications

The primary purpose of your job script is to launch your research application. How you do so depends on several factors, especially (1) the type of application (e.g. MPI, OpenMP, serial), and (2) what you're trying to accomplish (e.g. launch a single instance, complete several steps in a workflow, run several applications simultaneously within the same job). While there are many possibilities, your own job script will probably include a launch line that is a variation of one of the examples described in this section:

Note that the following examples demonstrate launching within a Slurm job script or an idev session. Do not launch jobs on the login nodes.

One Serial Application

To launch a serial application, simply call the executable. Specify the path to the executable in either the $PATH environment variable or in the call to the executable itself:

myprogram                      # executable in a directory listed in $PATH
$WORK/apps/myprov/myprogram    # explicit full path to executable
./myprogram                    # executable in current directory
./myprogram -m -k 6 input1     # executable with notional input options

One Multi-Threaded Application

Launch a threaded application the same way. Be sure to specify the number of threads. Note that the default OpenMP thread count is 1.

export OMP_NUM_THREADS=144     # 144 total OpenMP threads (1 per GG core)
./myprogram

One MPI Application

To launch an MPI application, use the TACC-specific MPI launcher ibrun, which is a Vista-aware replacement for generic MPI launchers like mpirun and mpiexec. In most cases the only arguments you need are the name of your executable followed by any arguments your executable needs. When you call ibrun without other arguments, your Slurm #SBATCH directives will determine the number of ranks (MPI tasks) and number of nodes on which your program runs.

#SBATCH -N 4
#SBATCH -n 576
ibrun ./myprogram              # ibrun uses the $SBATCH directives to properly allocate nodes and tasks
To use ibrun interactively, say within an idev session, you can specify:

login1$ idev -N 2 -n 80 -p gg
c123-456$ ibrun ./myprogram    # ibrun uses idev's arguments to properly allocate nodes and tasks

One Hybrid (MPI+Threads) Application

When launching a single application you generally don't need to worry about affinity: both OpenMPI and MVAPICH2 will distribute and pin tasks and threads in a sensible way.

export OMP_NUM_THREADS=8    # 8 OpenMP threads per MPI rank
ibrun ./myprogram           # use ibrun instead of mpirun or mpiexec
As a practical guideline, the product of $OMP_NUM_THREADS and the maximum number of MPI processes per node should not be greater than total number of cores available per node (GG nodes have 144 cores, GH nodes have 72 cores).

MPI Applications - Consecutive

To run one MPI application after another (or any sequence of commands one at a time), simply list them in your job script in the order in which you'd like them to execute. When one application/command completes, the next one will begin.

module load git
module list
./preprocess.sh
ibrun ./myprogram input1    # runs after preprocess.sh completes
ibrun ./myprogram input2    # runs after previous MPI app completes

MPI Application - Concurrent

Coming soon.

More than One OpenMP Application Running Concurrently

You can also run more than one OpenMP application simultaneously on a single node, but you will need to distribute and pin tasks appropriately. In the example below, numactl -C specifies virtual CPUs (hardware threads). According to the numbering scheme for GG cores, CPU () numbers 0-143 are spread across the 144 cores, 1 thread per core.

export OMP_NUM_THREADS=2
numactl -C 0-1 ./myprogram inputfile1 &  # HW threads (hence cores) 0-1. Note ampersand.
numactl -C 2-3 ./myprogram inputfile2 &  # HW threads (hence cores) 2-3. Note ampersand.

wait

Interactive Sessions

Interactive Sessions with idev and srun

TACC's own idev utility is the best way to begin an interactive session on one or more compute nodes. To launch a thirty-minute session on a single node in the development queue, simply execute:

login1$ idev

You'll then see output that includes the following excerpts:

...
-----------------------------------------------------------------
      Welcome to the Vista Supercomputer          
-----------------------------------------------------------------
...

-> After your `idev` job begins to run, a command prompt will appear,
-> and you can begin your interactive development session. 
-> We will report the job status every 4 seconds: (PD=pending, R=running).

->job status:  PD
->job status:  PD
...
c449-001$

The job status messages indicate that your interactive session is waiting in the queue. When your session begins, you'll see a command prompt on a compute node (in this case, the node with hostname c449-001). If this is the first time you launch idev, the prompts may invite you to choose a default project and a default number of tasks per node for future idev sessions.

For command line options and other information, execute idev --help. It's easy to tailor your submission request (e.g. shorter or longer duration) using Slurm-like syntax:

login1$ idev -p gg -N 2 -n 8 -m 150 # gg queue, 2 nodes, 8 total tasks, 150 minutes

For more information see the idev documentation.

Interactive Sessions using ssh

If you have a batch job or interactive session running on a compute node, you "own the node": you can connect via ssh to open a new interactive session on that node. This is an especially convenient way to monitor your applications' progress. One particularly helpful example: login to a compute node that you own, execute top, then press the "1" key to see a display that allows you to monitor thread ("CPU") and memory use.

There are many ways to determine the nodes on which you are running a job, including feedback messages following your sbatch submission, the compute node command prompt in an idev session, and the squeue or showq utilities. The sequence of identifying your compute node then connecting to it would look like this:

login1$ squeue -u bjones
 JOBID       PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
858811     skx-dev     idv46796   bjones  R       0:39      1 c448-004
1ogin1$ ssh c448-004
...
C448-004$

Slurm Environment Variables

Be sure to distinguish between internal Slurm replacement symbols (e.g. %j described above) and Linux environment variables defined by Slurm (e.g. SLURM_JOBID). Execute env | grep SLURM from within your job script to see the full list of Slurm environment variables and their values. You can use Slurm replacement symbols like %j only to construct a Slurm filename pattern; they are not meaningful to your Linux shell. Conversely, you can use Slurm environment variables in the shell portion of your job script but not in an #SBATCH directive.

Warning

For example, the following directive will not work the way you might think:

#SBATCH -o myMPI.o${SLURM_JOB_ID}   # incorrect

Tip

Instead, use the following directive:

#SBATCH -o myMPI.o%j     # "%j" expands to your job's numerical job ID

Similarly, you cannot use paths like $WORK or $SCRATCH in an #SBATCH directive.

For more information on this and other matters related to Slurm job submission, see the Slurm online documentation; the man pages for both Slurm itself (man slurm) and its individual commands (e.g. man sbatch); as well as numerous other online resources.

NVIDIA Performance Libraries (NVPL)

The NVIDIA Performance Libraries (NVPL) are a collection of high-performance mathematical libraries optimized for the NVIDIA Grace Armv9.0 architecture. These CPU-only libraries are for standard C and Fortran mathematical APIs allowing HPC applications to achieve maximum performance on the Grace platform. The collection includes:

NVIDIA Documentation

Consult the above documents for the details of each library and its API for building and linking codes. The libraries work with both NVHPC and GCC compilers, as well as their corresponding MPI wrappers. All libraries support the OpenMP runtime libraries. Refer to individual libraries documentation for details and API extensions supporting nested parallelism.

Compiler Examples

Example: A compile/link process on Vista may look like the following: This links the code against the NVPL FFT library using the GNU g++ compiler. The features in NVPL FFT are still evolving, please pay close attention and follow the latest NVPL FFT document.

$ module load nvpl
$ g++ mycode.cpp -I$TACC_NVPL_DIR}/include \
        -L$TACC_NVPL_DIR}/lib \
        -lnvpl_fftw \
        -o myprogram

Example: This links the code against the NVPL OpenMP threaded BLAS, LAPACK, and SCALAPACK libraries of 32 bit integer interface using the NVHPC mpif90 wrapper. The cluster capability of BLAS from current NVPL release from NVHPC SDK-24.5 includes openmpi3,4,5 and mpich, choose the one that matches the MPI version in mpif90.

$ module load nvpl            
$ mpif90 -mp -I$TACC_NVPL_DIR}/include  \
       -L${TACC_NVPL_DIR}/lib   \
       -lnvpl_blas_lp64_gomp   \
       -lnvpl_lapack_lp64_gomp  \
      -lnvpl_blacs_lp64_openmpi5 \
      -lnvpl_scalapack_lp64   \
       mycode.f90

When linking using NVHPC compiler, convenient flags -Mnvpl and -Mscalapack are provided. As the behavior of these flags may change during active development, please refer to the latest NVHPC compiler guide for more details.

Using NVPL as BLAS/LAPACK with Third-Party Software

When your third-party software requires BLAS or LAPACK, we recommend that you use NVPL to supply this functionality. Replace generic instructions that include link options like -lblas or -llapack with the NVPL approach described above. Generally there is no need to download and install alternatives like OpenBLAS. However, since the NVPL is a relatively new math library suite targeting the aarch64, its interoperability to other softwares with a special 32 or 64 bit integer interface, or OpenMP runting support are not fully tested yet. If you have issues with NVPL and alternative BLAS, LAPACK libraries are needed, the OpenBLAS based ones are available as a part of NVHPC compiler libraries.

Controlling Threading in NVPL

All NVPL libraries support the both GCC and NVHPC OpenMP runtime libraries. See individual libraries documentation for details and API extensions supporting nested parallelism. NVPL Libraries do not explicitly link any particular OpenMP runtime, they rely on runtime loading of the OpenMP library as determined by the application and environment. Applications linked to NVPL should always use at runtime the same OpenMP distribution the application was compiled with. Mixing OpenMP distributions from compile-time to runtime may result in anomalous performance. Please note that the default library linked with -Mnvpl flag is single threaded as of NVHPC 24.5, -mpflag is needed to linked with the threaded version.

NVIDIA HPC modules provide a libgomp.so symlink to libnvomp.so. This symlink will be on LD_LIBRARY_PATH if NVHPC environment modules are loaded. Use ldd to ensure that applications built with GCC do not accidentally load libgomp.sosymlink from HPC SDK due to LD_LIBRARY_PATH. Use libnvomp.soif if and only if the application was built with NVHPC compilers.

$OMP_NUM_THREADS defaults to 1 on TACC systems. If you use the default value you will get no thread-based parallelism from NVPL. Setting the environment variable $OMP_NUM_THREADS to control the number of threads for optimal performance.

Using NVPL with other MATLAB, PYTHON and R

TACC MATLAB, Python and R modules need BLAS and LAPACK and other math libraries for performance. How to use NVPL with them is under investigation. We will update.

Help Desk

Important

Submit a help desk ticket at any time via the TACC User Portal. Be sure to include "Vista" in the Resource field.

TACC Consulting operates from 8am to 5pm CST, Monday through Friday, except for holidays. Help the consulting staff help you by following these best practices when submitting tickets.

  • Do your homework before submitting a help desk ticket. What does the user guide and other documentation say? Search the internet for key phrases in your error logs; that's probably what the consultants answering your ticket are going to do. What have you changed since the last time your job succeeded?

  • Describe your issue as precisely and completely as you can: what you did, what happened, verbatim error messages, other meaningful output.

Tip

When appropriate, include as much meta-information about your job and workflow as possible including:

  • directory containing your build and/or job script
  • all modules loaded
  • relevant job IDs
  • any recent changes in your workflow that could affect or explain the behavior you're observing.
  • Subscribe to Vista User News. This is the best way to keep abreast of maintenance schedules, system outages, and other general interest items.

  • Have realistic expectations. Consultants can address system issues and answer questions about Vista. But they can't teach parallel programming in a ticket, and may know nothing about the package you downloaded. They may offer general advice that will help you build, debug, optimize, or modify your code, but you shouldn't expect them to do these things for you.

  • Be patient. It may take a business day for a consultant to get back to you, especially if your issue is complex. It might take an exchange or two before you and the consultant are on the same page. If the admins disable your account, it's not punitive. When the file system is in danger of crashing, or a login node hangs, they don't have time to notify you before taking action.