How to Use NVIDIA’s NIMS with Apptainer on HPC Clusters

NVIDIA NIMs (NVIDIA Inference Microservices) are pre-built, GPU-optimized containers that bundle AI models with their inference engines and APIs, making the deployment of advanced AI models simple, fast, and reproducible.

Many NIMs now offer a “Run Anywhere” option, allowing you to pull them directly from NVIDIA’s registry as Docker containers and start them instantly.

However, on HPC clusters (like DAIS or Raven), Docker is typically not permitted because it requires root privileges. Instead, we use Apptainer (formerly Singularity), which is being developed to provide container technologies on HPC systems.

Inspired by Protein Binder Design Blueprint this guide shows how to run two NVIDIA NIMs - RFDiffusion and ProteinMPNN - simultaneously within a single Slurm job on Dais.

Step 1: Get your NVIDIA API key

  1. Go to build.nvidia.com and log in.
  2. Click Profile → API Keys → Generate API Key.
  3. Copy the generated token - you’ll need it in the next steps.

Step 2: Set up your environment on Dais

  • Load Apptainer and log in to NVIDIA’s registry using your API key:

    module load apptainer/1.4.1
    export NGC_API_KEY=<PASTE_API_KEY_HERE>
    apptainer registry login --username '$oauthtoken' --password "$NGC_API_KEY" docker://nvcr.io
    
  • Create a working directory (we’ll use /dais/fs/scratch filesystemfor more space):

    mkdir -p /dais/fs/scratch/$USER/Protein_Binder_Design_Pipeline
    cd /dais/fs/scratch/$USER/Protein_Binder_Design_Pipeline
    
  • Pull both NIM containers directly from NVIDIA’s registry:

    apptainer pull rfdiffusion.sif docker://nvcr.io/nim/ipd/rfdiffusion:latest
    apptainer pull proteinmpnn.sif docker://nvcr.io/nim/ipd/proteinmpnn:latest
    

Step 3: Running two NIMs in one Slurm job

Now we’ll start both containers on separate GPUs and run a Python script that communicates with them.

  1. Let’s create the Slurm submission script to launch both NIM containers and run the Python client (don’t forget to replace <PASTE_API_KEY_HERE> with your actual API key!):
#!/bin/bash -l
#SBATCH -o logs/%j.log
#SBATCH -e logs/%j.log
#SBATCH -D ./
#SBATCH -J nims
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --partition=gpu1
#SBATCH --gres=gpu:h200:2
#SBATCH --mem=500000
#SBATCH --cpus-per-task=24
#SBATCH --time=08:15:00
  
##################################################################
#                         JOB CONFIGURATION                      #
##################################################################
  
# Exit immediately on any error or undefined variable; fail if any pipe command fails
set -euo pipefail
  
# Ensure that all background jobs are terminated if anything fails or job is cancelled
cleanup() {
    echo "Force killing containers and any leftovers..."

    # --- Kill main launch PIDs if they exist ---
    for pid_var in rfdiffusion_pid proteinmpnn_pid; do
        pid="${!pid_var:-}"
        if [[ -n "$pid" ]]; then
            kill -KILL "$pid" 2>/dev/null || true
            pkill -P "$pid" 2>/dev/null || true
        fi
    done

    # --- Kill remaining processes by image name ---
    pkill -f rfdiffusion.sif 2>/dev/null || true
    pkill -f proteinmpnn.sif 2>/dev/null || true

    # --- Kill remaining apptainer runtimes (rare, but safest) ---
    pkill -f apptainer 2>/dev/null || true
}

trap cleanup EXIT ERR INT TERM
  
##################################################################
#                        ENVIRONMENT SETUP                      #
##################################################################
  
module purge
module load apptainer/1.4.1
cd /dais/fs/scratch/$USER/Protein_Binder_Design_Pipeline
  
# --- Authentication token for NVIDIA NIM ---
export NGC_API_KEY="<PASTE_API_KEY_HERE>"

  
# --- Directories for temporary files and caching ---

export LOCAL_CACHE="/dais/fs/scratch/$USER/Protein_Binder_Design_Pipeline/cache"
export LOCAL_NIM_CACHE="$LOCAL_CACHE/nim"
  
mkdir -p "$LOCAL_CACHE" "$LOCAL_NIM_CACHE"
  
##################################################################
#                        START BACKGROUND CONTAINERS            #
##################################################################
# Generate unique port offsets based on job ID
export MASTER_PORT=$((10000 + (${SLURM_JOBID:-0} % 1000)))
  
# --- RFDiffusion server (GPU 1) ---
export rfdiffusion_home="$LOCAL_CACHE/rfdiffusion_home"
mkdir -p "$rfdiffusion_home"
  
export RFDIFFUSION_PORT=$((1 + MASTER_PORT))
  
echo "Starting RFDiffusion container..."
apptainer run --nv \
  --home "$rfdiffusion_home" \
  --bind "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  --env NGC_API_KEY="$NGC_API_KEY" \
  --env NIM_HTTP_API_PORT=$RFDIFFUSION_PORT \
  --env CUDA_VISIBLE_DEVICES=1 \
  --compat rfdiffusion.sif &
  
rfdiffusion_pid=$!
echo "   -> RFDiffusion started (PID: $rfdiffusion_pid)"
  
# --- ProteinMPNN server (GPU 0) ---
export proteinmpnn_home="$LOCAL_CACHE/proteinmpnn_home"
mkdir -p "$proteinmpnn_home"
  
export PROTEINMPNN_PORT=$((2 + MASTER_PORT))
echo "Starting ProteinMPNN container..."
apptainer run --nv \
  --home "$proteinmpnn_home" \
  --bind "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  --env NGC_API_KEY="$NGC_API_KEY" \
  --env NIM_HTTP_API_PORT=$PROTEINMPNN_PORT \
  --env CUDA_VISIBLE_DEVICES=0 \
  --compat proteinmpnn.sif &
  
proteinmpnn_pid=$!
echo "   -> ProteinMPNN started (PID: $proteinmpnn_pid)"
  
##################################################################
#                        SERVER HEALTH CHECKS                   #
##################################################################
  
# Function to wait for a service to become ready
wait_for_service() {
    local url="$1"
    local name="$2"
    local pid="$3"
    local timeout=600
    local check_interval=2
    local elapsed=0

    echo "Waiting for $name at $url ..."
    until curl -sf "$url" > /dev/null 2>&1; do
        sleep $check_interval
        elapsed=$((elapsed + check_interval))

        # Check if corresponding container process is still alive
        if ! kill -0 "$pid" 2>/dev/null; then
            echo "ERROR: $name container (PID $pid) has terminated!"
            exit 1
        fi

        if [ $elapsed -ge $timeout ]; then
            echo "TIMEOUT: $name did not become ready."
            exit 1
        fi
    done

    echo "$name is ready!"
}
  
# Check both servers
wait_for_service http://127.0.0.1:${RFDIFFUSION_PORT}/v1/health/ready "RFDiffusion" "$rfdiffusion_pid"
wait_for_service http://127.0.0.1:${PROTEINMPNN_PORT}/v1/health/ready "ProteinMPNN" "$proteinmpnn_pid"
  
##################################################################
#                        MAIN TASK EXECUTION                     #
##################################################################
  
echo "All servers are up. Launching main Python script..."
  
module load python-waterboa/2024.06
python main.py
  
##################################################################
#                        CLEANUP (handled by trap)              #
##################################################################
  
echo "Job finished successfully. Containers will now shut down."
  • Create the Python client (main.py).
    This script queries both NIMs via their REST APIs to generate protein backbones and predict sequences.
    (Examples were adapted from the the blueprint’s NVIDIA BioNeMo Blueprints GitHub repository, as well as RFDiffusion deploy guide and ProteinMPNN deploy guide.):

    
    import json
    import os
    import requests
    from enum import StrEnum, Enum
    from typing import Optional, Tuple, Dict, Any, List
    from pathlib import Path
    
    NVIDIA_API_KEY = os.getenv("NGC_API_KEY")
    
    HEADERS = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {NVIDIA_API_KEY}",
        "poll-seconds": "900"
        }
    
    NIM_HOST_URL_BASE = "http://localhost"
    
    class NIM_PORTS(Enum):
        RFDIFFUSION_PORT = int(os.environ["RFDIFFUSION_PORT"])
        PROTEINMPNN_PORT = int(os.environ["PROTEINMPNN_PORT"])
    
    
    class NIM_ENDPOINTS(StrEnum):
        RFDIFFUSION =  "biology/ipd/rfdiffusion/generate"
        PROTEINMPNN =  "biology/ipd/proteinmpnn/predict"
        
        
    def check_nim_readiness(nim_port: NIM_PORTS,
                            base_url: str = NIM_HOST_URL_BASE,
                            endpoint: str = "v1/health/ready") -> bool:
        """
        Return true if a NIM is ready.
        """
        try:
            response = requests.get(f"{base_url}:{nim_port}/{endpoint}")
            d = response.json()
            if "status" in d:
                if d["status"] == "ready":
                    return True
            return False
        except Exception as e:
            print(e)
            return False
        
    def get_reduced_pdb(pdb_id: str = "1R42.pdb", rcsb_path: str = None) -> str:
        pdb = Path(pdb_id)
        if not pdb.exists():
            pdb.write_text(requests.get(f"https://files.rcsb.org/download/{pdb}").text)
        lines = filter(lambda line: line.startswith("ATOM"), pdb.read_text().split("\n"))
        return "\n".join(list(lines))
    
    def query_nim(
                payload: Dict[str, Any],
                nim_endpoint: str,
                headers: Dict[str, str] = HEADERS,
                base_url: str = NIM_HOST_URL_BASE,
                nim_port: int = 8080,
                echo: bool = False) -> Tuple[int, Dict]:
        function_url = f"{base_url}:{nim_port}/{nim_endpoint}"
        if echo:
            print("*"*80)
            print(f"\tURL: {function_url}")
            print(f"\tPayload: {payload}")
            print("*"*80)
        response = requests.post(function_url,
                                json=payload,
                                headers=headers)
        if response.status_code == 200:
            return response.status_code, response.json()
        else:
            raise Exception(f"Error: response returned code [{response.status_code}], with text: {response.text}")
    
    if __name__=="__main__":
          
        status = check_nim_readiness(NIM_PORTS.PROTEINMPNN_PORT.value)
        print(f"ProteinMPNN NIM is ready: {status}", flush = True)
    
        status = check_nim_readiness(NIM_PORTS.RFDIFFUSION_PORT.value)
        print(f"RFDiffusion NIM is ready: {status}", flush = True)
        
        # run RFDiffusion - example params from https://build.nvidia.com/ipd/rfdiffusion/deploy
        rfdiffusion_query = {
            "input_pdb" : get_reduced_pdb(), ## Take the first structure prediction (of 5) from AlphaFold2
            "contigs": "A20-60/0 50-100",
            "hotspot_res": ["A50","A51","A52","A53","A54"],
            "diffusion_steps": 15,
        }
    
        rc, rfdiffusion_response = query_nim(
            payload=rfdiffusion_query,
            nim_endpoint=NIM_ENDPOINTS.RFDIFFUSION.value,
            nim_port=NIM_PORTS.RFDIFFUSION_PORT.value
        )
        
        ## Print the first 160 characters of the RFDiffusion PDB output
        print(rfdiffusion_response["output_pdb"][0:160], flush = True)
        
        
        #Run ProteinMPNN - example params from https://build.nvidia.com/ipd/proteinmpnn/deploy
        
        proteinmpnn_query = {
            "input_pdb" : get_reduced_pdb(),
            "ca_only" : False,
            "use_soluble_model" : False,
            "sampling_temp" : [0.1]
        }
    
        rc, proteinmpnn_response = query_nim(
            payload=proteinmpnn_query,
            nim_endpoint=NIM_ENDPOINTS.PROTEINMPNN.value,
            nim_port=NIM_PORTS.PROTEINMPNN_PORT.value
        )
        
        fasta_sequences = [x.strip() for x in proteinmpnn_response["mfasta"].split("\n") if '>' not in x][2:]
    
        print(f"Generated {len(fasta_sequences)} FASTA sequences")
    

3