Single FANTASIA Job Execution on CESGA (GPU + Apptainer)¶

This section describes how to launch a single, non-array job of the FANTASIA pipeline on CESGA using SLURM with GPU acceleration and containerized execution via Apptainer.

This mode is suitable for individual experiments with one input file and fixed parameters.

SLURM Script Overview¶

The job script performs the following actions:

Launches PostgreSQL with pgvector via Apptainer
Launches RabbitMQ via Apptainer
Runs FANTASIA inside an Apptainer container with GPU support
Mounts required volumes using bind mounts
Uses LUSTRE for persistent storage of containers, models, and caches

SLURM Directives¶

#SBATCH -p gpu
#SBATCH --gres=gpu:a100:1
#SBATCH -c 32
#SBATCH --mem=64G
#SBATCH -t 08:00:00
#SBATCH --job-name=fantasia_job
#SBATCH --output=fantasia_%j.out
#SBATCH --error=fantasia_%j.err

Resources requested:

GPU: 1 × A100
CPU: 32 cores
Memory: 64 GB
Walltime: 8 hours

Input Parameters¶

For single-job execution, input parameters can be hardcoded within the script, or you can modify the configuration file.

Persistent Storage Paths (LUSTRE)¶

The following environment variables are defined to cache models, containers, and intermediate files:

export TRANSFORMERS_CACHE="/mnt/lustre/.../.transformers_cache"
export HF_HOME="/mnt/lustre/.../.hf_cache"
export UDOCKER_DIR="/mnt/lustre/.../.udocker_repo"
export APPTAINER_CACHEDIR="/mnt/lustre/.../.singularity/cache"
export APPTAINER_TMPDIR="/mnt/lustre/.../.singularity/tmp"
export APPTAINER_LOCALCACHEDIR="/mnt/lustre/.../.singularity/local"
export PIP_CACHE_DIR="/mnt/lustre/.../.pip_cache"

These should point to your own directory in LUSTRE.

Additional directories used during execution:

PROJECT_DIR="$STORE/FANTASIA"
EXECUTION_DIR="$STORE/fantasia"
SHARED_MEM_DIR="/tmp/fantasia"
POSTGRESQL_DATA="$SHARED_MEM_DIR/data"
POSTGRESQL_SOCKET="$SHARED_MEM_DIR/socket"
RABBITMQ_DATA_DIR="$STORE/fantasia_rabbitmq"

Apptainer Container Setup¶

The following container images are used:

fantasia.sif — main pipeline (GPU-enabled)
pgvector.sif — PostgreSQL database with pgvector extension
rabbitmq.sif — message broker

Images are built automatically if missing:

apptainer build fantasia.sif docker://frapercan/fantasia:latest

Execution Phase¶

After services are launched in the background, the FANTASIA pipeline is initialized and executed:

apptainer exec --nv --bind "$EXECUTION_DIR:/fantasia" "$FANTASIA_IMAGE" \
    fantasia initialize

apptainer exec --nv --bind "$EXECUTION_DIR:/fantasia" "$FANTASIA_IMAGE" \
    fantasia run

The --nv flag enables GPU passthrough.
All outputs are written under $EXECUTION_DIR.

Shutdown and Cleanup¶

A cleanup routine is executed at the end to terminate services and remove temporary data:

pkill -f "rabbitmq-server"
pkill -f "$POSTGRESQL_DATA"
rm -rf "$SHARED_MEM_DIR"

Launching the Job¶

To launch the job:

sbatch fantasia_single.sh

Where fantasia_single.sh is the name of your job script.

Log files are created as:

fantasia_<jobid>.out — SLURM standard output
fantasia_<jobid>.err — SLURM standard error
postgres.log — PostgreSQL log
rabbitmq.log — RabbitMQ log

Summary¶

This job script enables fully reproducible, self-contained execution of the FANTASIA pipeline on CESGA’s GPU nodes using Apptainer. No root access or external services are required. All data, services, and containers are managed within the node, ensuring high portability and reproducibility.