Performance on HPC¶

For detailed instructions on deploying FANTASIA in an HPC environment, please refer to the HPC Deployment Guide.

Input Data¶

The input dataset used for this performance evaluation consists of all protein sequences from Mus musculus (house mouse) available in UniProt:

Source: UniProtKB REST API
Taxonomy: Mus musculus (species-level dataset)
Total sequences: 87,492

The dataset will be processed using FANTASIA on an HPC system to evaluate performance in terms of execution time, resource utilization, and scalability.

Execution Parameters¶

General Settings¶

Maximum number of worker threads for parallel processing: max_workers: 50
Reference tag used for lookup operations: lookup_reference_tag: GOA2022
K-closest protein to consider for lookup: limit_per_entry: 1
Prefix for output file names: fantasia_prefix: uniprotkb_taxonomy_id_10090_2025_03_13
Threshold for sequence length filtering: length_filter: 5000000
Threshold for redundancy filtering: redundancy_filter: 0.95
Number of sequences to package in each queue batch: sequence_queue_package: 1024
Delete queues after processing: delete_queues: True

Embedding Configuration¶

Distance metric: distance_metric: "<->" (options: "<=>" for cosine or "<->" for Euclidean)
Models: - ESM:
- Enabled: True
- Distance threshold: 1.5
- Batch size: 256
- Prost-T5: - Enabled: True - Distance threshold: 1.5 - Batch size: 256
- Prot-T5: - Enabled: True - Distance threshold: 3 - Batch size: 256

Functional Analysis¶

Enable Gene Ontology enrichment analysis using TopGO: topgo: True

Hardware Configuration¶

The execution was performed on an HPC node equipped with:

CPU: 256 cores
Total RAM: 100GB
GPU Model: NVIDIA A100-SXM4-80GB
CUDA Version: 12.2
Driver Version: 535.230.02
Available GPUs: 4
GPUs in use: 1

Although more CPU cores were available, the execution was limited to 50 worker threads for parallel processing.

Summary¶

100 workers allow parallel execution of queries.
No sequence length filtering (value set extremely high).
CD-HIT at 95% sequence identity to remove redundancy.
Only proteins from GOA2022 are used as reference.
Euclidean distance metric is applied.
Batch size of 256 for all three embedding models.
Execution performed on NVIDIA A100 GPUs with CUDA 12.2.
Only 1 GPU was used, despite 4 being available.
256 CPU cores available, but only 50 were used.
100GB of RAM available during execution.

Execution Times¶

Embedding Generation¶

The execution times for generating embeddings with different models are summarized below:

Model	Total Time	Time per Sample
ESM	18 min 21 sec	12.59 ms/sample
ProSTT5	1 hr 51 min 37 sec	76.55 ms/sample
ProtT5	2 hr 1 min 6 sec	83.05 ms/sample

Lookup Processing¶

Lookup operations were performed after embedding generation, with the following performance:

Operation	Total Time	Time per Sample
Lookup	9 hr 25 min 7 sec	387.54 ms/sample

Conclusions¶

ESM is the fastest model, processing each sample in 12.59 ms.
ProSTT5 and ProtT5 are significantly slower, with times of 76.55 ms and 83.05 ms per sample, respectively.
Lookup is the bottleneck, taking 4.5 times longer than embedding generation at 387.54 ms per sample.
Further optimization of lookup operations (e.g., parallelization improvements or better GPU utilization) could significantly reduce processing time.