How-To Run an LLM on the DFKI Cluster
IMPORTANT
If you run a new model on the cluster, you need to ensure that all other users can run the model too. Running a new model implies that VLLM downloads and caches the model on the cluster. To do that, it creates a new directory and some files underneath the VLLM_DOWNLOAD_DIR, which in our configuration usually points to /ds/models/llms/cache. The access right to this new subdirectory and any associated files are set to "rwxr-xr-x" , which doesn't allow writing for others. So, whenever you have run a new model on the cluster, run the following command afterwards:
cd /ds/models/llms && chmod -R 777 ./*
This ensures that the downloaded files (or .lock files left over from crashes) can be used and or fixed by other users.
Quickstart
Prerequisites
based on: DFKI-NLP/vLLM-Starter#2 (comment)
- Install uv: https://docs.astral.sh/uv/getting-started/installation/
- open a new shell (or a create a new screen:
screen -S vLLM-Starter)
Run gpt-oss-20b
Based on instructions from https://github.com/DFKI-NLP/vLLM-Starter.
start the service:
export HF_HOME="/netscratch/$USER/cache/hf"
export VLLM_CACHE_ROOT="/netscratch/$USER/cache/vllm"
srun --partition=RTXA6000-SLT \
--job-name=vllm_serve \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
--time=0-04:00:00 \
uvx --cache-dir /netscratch/$USER/cache/uv \
vllm@0.12.0 serve "openai/gpt-oss-20b" \
--download-dir=/ds/models/llms/cache \
--port=18000
If you need a different version of vLLM, change @0.12.0 to @your.version.here or @latest. docs
Note: This may take some time, wait for Application startup complete.
Important: For models other than gpt-oss-20B, please check the respective configs in configs/extractor/llm for additional command line arguments to vllm serve. For example, Qwen3 requires to specify the additional argument '--reasoning-parser deepseek_r1', Nemotron-Nano uses '--trust-remote-code' etc.
query:
# Note: You may need to select a different NODE than `serv-9220`. I got it by calling `squeue -u $USER`.
curl http://serv-9220.kl.dfki.de:18000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-20b",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
result:
{
"id":"cmpl-daceb4b0e6cf49338a588eff564ad85a",
"object":"text_completion",
"created":1759957476,
"model":"openai/gpt-oss-20b",
"choices":[
{
"index":0,
"text":" city in California, USA. It",
"logprobs":null,
"finish_reason":"length",
"stop_reason":null,
"token_ids":null,
"prompt_logprobs":null,
"prompt_token_ids":null
}
],
"service_tier":null,
"system_fingerprint":null,
"usage":{
"prompt_tokens":4,
"total_tokens":11,
"completion_tokens":7,
"prompt_tokens_details":null
},
"kv_transfer_params":null
}
The two ways to use uv on Pegasus
In General:
- In case you want to run a gated model, log into HuggingFace and make sure you have the correct access permissions.
- Choose model, resources and max job runtime carefully.
- Check for special requirements, like chat templates, tokenisers, and allowing local code execution.
Single package srun
If you want to do not much setup and all you need is one package, for example when using vllm serve, use uvx with the following flag and eviroment variables:
export HF_HOME="/netscratch/$USER/cache/hf"this prevents the huggingface cache from filling up your home directory.export VLLM_CACHE_ROOT="/netscratch/$USER/cache/vllm"this prevents the vLLM cache from filling up your home directory.--cache-dir /netscratch/$USER/cache/uvthis prevents uv from filling up your home directory.
This approach is used in the Quickstart section.
If the tool you want to use is invoked with a different name than it is installed, then use --from <install-name>.
example:
uvx --from rust-just just The rust-just package is invoked by calling just.
Full project srun
To run a uv project with any number of custom python packages in your environment, you need to set up a few things.
Firstly, you need to create directories on netscratch where the uv virtual environment and caches can live:
mkdir -p /netscratch/$USER/cache/uv
mkdir -p /netscratch/$USER/cache/uv-venvs
mkdir -p /netscratch/$USER/cache/hf
mkdir -p /netscratch/$USER/cache/vllm
Secondly, you need to set the environment variables for the uv cache and virtual environment to the directories you just created. This will point uv there and prevent it from causing your home directory to overflow.
export UV_CACHE_DIR="/netscratch/$USER/cache/uv"
export UV_PROJECT_ENVIRONMENT="/netscratch/$USER/cache/uv-venvs/kibad-llm"
Thirdly, create the virtual environment and symlink the directories so that you don't need to set the environment variables each time you open a new shell.
# create the .venv
uv sync
# link the .venv
ln -s /netscratch/$USER/cache/uv-venvs/kibad-llm ./.venv
# link the caches
ln -s /netscratch/$USER/cache/uv ~/.cache/uv
ln -s /netscratch/$USER/cache/hf ~/.cache/huggingface
ln -s /netscratch/$USER/cache/vllm ~/.cache/vllm
Now you can all of your python code without worrying about the uv cache, virtual environment, or huggingface/ vllm caches.
srun --your-srun-flag \
uv run -m your.file.here
Important: When running experiments for KIBA-D, it is highly (!) recommended to sym-link output directories
to the folders in /netscratch/hennig/kiba-d/, to ensure that everyone has access to the experiment results. Consider
executing the following on a fresh clone of the kibad-llm repository:
ln -s /netscratch/hennig/kiba-d/logs ./logs
ln -s /netscratch/hennig/kiba-d/predictions ./predictions
(If these folders already exist in your kibad-llm repository because you ran inference previously, you might want to delete or rename them, and then execute the above commands.)
All-in-one run script for in_process VLLM configs
To host an llm on the cluster and run uv code against it in a python-internal setup, without the use of an external
VLLM server, use the all-in-one run script run_in_process.sh. Note that this requires the use of the *_in_process.yaml configs
in configs/extractor/llm when executing uv run -m kibad_llm.predict. Using the run_in_process.sh script also
allows to run the Open AI models via their API, e.g. GPT-5, see configs/extractor/llm/gpt_5.yaml. This can even be done in
combination with VLLM-served models in a single experiment run, see for example the template commands in
https://github.com/DFKI-NLP/kibad-llm/issues/32 .
Prerequisites
In order to use run_in_process.sh you need to have followed the steps in Full project srun!
In addition, set up your .env file:
cp .env.example .env
Edit .env and fill out the following mandatory variables:
HF_TOKEN=<your_hf_token>
OPENAI_API_KEY=<your_openai_api_key>
VLLM_DOWNLOAD_DIR=/ds/models/llms/cache
You can create an Open AI key at https://platform.openai.com/api-keys and Huggingface access tokens at https://huggingface.co/settings/tokens.
Usage
run_in_process.sh uses flags with command line arguments.
-
-h | --helpdisplays very similar help to this. -
pa | --partitionis the slurm partition to submit the job to. This is an optional flag and uses"RTX6000-SLT"per default. -
-t | --timeis the maximum time the slurm job is allowed to run. This is an optional flag and uses one hour per default. -
-u | --uvis used for alluv runarguments. If there are multiple, make sure to wrap them in quotes like"-m some.code"which results inuv run -m some.code. This is a required flag.
The script takes care of everything start to finish and executes all code on the compute node. As soon as the job gets resources, the uv run command (e.g. predict.py) starts.
The alternatives
All-in-one run external VLLM script
To host an llm on the cluster and run uv code against it as soon as the model is ready, use the all-in-one run script run_with_llm.sh
Prerequisites
In order to use run_with_llm.sh you need to have followed the steps in Full project srun!
Usage
run_with_llm.sh uses flags with command line arguments.
-
-h | --helpdisplays very similar help to this. -
-v | --vllmis used for almost all arguments relevant to vLLM. If there are multiple, make sure to wrap them in quotes like"some/mistral --trust-remote-code". This is a required flag. -
-vv | --vllm-versionis the vLLM version to run. This is an optional flag and uses version 0.12.0 per default. -
-po | --portis the port vLLM and the uv code communicate on. This is an optional flag and uses a random port per default. -
pa | --partitionis the slurm partition to submit the job to. This is an optional flag and uses"RTX6000-SLT"per default. -
-t | --timeis the maximum time the slurm job is allowed to run. This is an optional flag and uses one hour per default. -
-u | --uvis used for alluv runarguments. If there are multiple, make sure to wrap them in quotes like"-m some.code"which results inuv run -m some.code. This is a required flag.
The script takes care of everything start to finish and executes all code on the compute node. As soon as the job gets resources, vLLM starts. The script then waits till vLLm is ready and starts your code with uv right after. This allows you to run heavy jobs without straining the login node. You can cancel the job with ctrl-c or scancel any time and don't need to worry about residual processes.
Run with VLLM but on Login Node
run_with_llm_login_node_exec.sh is very similar to run_with_llm.sh. The main difference is that this script puts vLLM on the compute node and runs your uv code locally on the login node. You need to be careful when using this script because of the strain you put on the login node.
Depending on how your uv code fails or the script is cancelled, the slurm job may need to be cancelled separately.
This script may make working on your code easier, depending on your needs, but you should probably have a quick look at how the script works in that case.