Do you want to run a giant LLM on Viper to perform inference? Then this post is for you! ![]()
With 671B total and 37B active parameters, DeepSeek-V3 is one of the larger open‑weight models available. To run the original FP8 version, you need roughly 800 GB of GPU memory. On our Viper system, this means requesting at least 4 nodes with 2 AMD MI300A GPUs each.
Optimizing LLM inference workloads is a hot topic. There are several open‑source frameworks that make it relatively easy to set up a performant inference server. One of them is SGLang. The SGLang team regularly releases Docker images with ROCm and AMD accelerator support.
We start by converting the latest SGLang image with ROCm support to an Apptainer SIF file:
module load apptainer/1.4.1
apptainer pull docker://lmsysorg/sglang:v0.5.4.post1-rocm700-mi30x
This may take a while due to the size of the container.
Next, we download the model weights from the Hugging Face Hub so that we do not waste time during our SLURM job on the precious accelerated worker nodes:
mkdir /ptmp/$USER/huggingface
apptainer exec \
-B /ptmp/$USER/huggingface:/root/.cache/huggingface \
--env HF_HOME=/root/.cache/huggingface \
--env HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN> \
sglang_v0.5.4.post1-rocm700-mi30x.sif hf download deepseek-ai/DeepSeek-V3
This may also take some time, since we need to download almost 650 GB.
Now we have everything ready to submit our inference workload.
We start our SLURM batch script by requesting the necessary resources and loading the Apptainer module:
#!/bin/bash -l
#SBATCH -D ./
#SBATCH -J sglang
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --time=01:00:00
module load apptainer/1.4.3
Newer ROCm versions support AITER, which performs JIT compilation and stores the results on disk. Since Apptainer SIF files are immutable, read‑only containers, we need to create persistent overlays so that ATIR can do its magic:
# Need to create an overlay for each process (node)
srun bash -c "apptainer overlay create --size 1024 overlay\$SLURM_NODEID.img"
Our full Apptainer command looks like this:
apptainer_cmd="apptainer exec \
--overlay overlay\$SLURM_NODEID.img \
-B /ptmp/$USER/huggingface:/root/.cache/huggingface \
--env HF_HOME=/root/.cache/huggingface \
--env HF_HUB_OFFLINE=1 \
sglang_v0.5.4.post1-rocm700-mi30x.sif"
For the SGLang command, we need to provide the IP address of the head node and the port used for inter‑node communication. We have 4 nodes with 2 accelerators each, which results in a tensor parallel size of 8.
HEAD_IPADDRESS="$(hostname --ip-address)"
PORT=8998
sglang_cmd="python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--disable-cuda-graph \
--tp 8 \
--nccl-init-addr $HEAD_IPADDRESS:$PORT \
--nnodes 4 \
--node-rank \$SLURM_NODEID \
--trust-remote-code"
Putting everything together, we start the inference server and wait for it to become operational. Loading the weights onto the accelerators may take some time, but you should see the progress in the server log.
srun -o "./server.%j.log" bash -c "$apptainer_cmd $sglang_cmd" &
PID=$!
apptainer exec sglang_v0.5.4.post1-rocm700-mi30x.sif \
python3 -c "from sglang.utils import wait_for_server; wait_for_server('http://localhost:30000')"
When the inference server is up and running, we can send requests to its OpenAI‑compatible REST API via curl, for example:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-ai/DeepSeek-V3", "messages": [{"role": "user", "content": "What is the physics behind the blue sky?"}]}'
That’s it! ![]()
For the complete SLURM script, have a look at our LLMs-meet-MPCDF repository. There you will also find more recipes for running LLM workloads on our HPC systems.