.. _performance_hpc: Performance on HPC ================== For detailed instructions on deploying FANTASIA in an HPC environment, please refer to the :ref:`HPC Deployment Guide `. Input Data ---------- The input dataset used for this performance evaluation consists of all protein sequences from *Mus musculus* (house mouse) available in UniProt: - **Source:** `UniProtKB REST API `_ - **Taxonomy:** *Mus musculus* (species-level dataset) - **Total sequences:** 87,492 The dataset will be processed using FANTASIA on an HPC system to evaluate performance in terms of execution time, resource utilization, and scalability. Execution Parameters ------------------- General Settings ~~~~~~~~~~~~~~~~ - **Maximum number of worker threads for parallel processing:** ``max_workers``: ``50`` - **Reference tag used for lookup operations:** ``lookup_reference_tag``: ``GOA2022`` - **K-closest protein to consider for lookup:** ``limit_per_entry``: ``1`` - **Prefix for output file names:** ``fantasia_prefix``: ``uniprotkb_taxonomy_id_10090_2025_03_13`` - **Threshold for sequence length filtering:** ``length_filter``: ``5000000`` - **Threshold for redundancy filtering:** ``redundancy_filter``: ``0.95`` - **Number of sequences to package in each queue batch:** ``sequence_queue_package``: ``1024`` - **Delete queues after processing:** ``delete_queues``: ``True`` Embedding Configuration ~~~~~~~~~~~~~~~~~~~~~~~ - **Distance metric:** ``distance_metric``: ``"<->"`` (options: ``"<=>"`` for cosine or ``"<->"`` for Euclidean) - **Models:** - **ESM:** - Enabled: ``True`` - Distance threshold: ``1.5`` - Batch size: ``256`` - **Prost-T5:** - Enabled: ``True`` - Distance threshold: ``1.5`` - Batch size: ``256`` - **Prot-T5:** - Enabled: ``True`` - Distance threshold: ``3`` - Batch size: ``256`` Functional Analysis ~~~~~~~~~~~~~~~~~~~ - **Enable Gene Ontology enrichment analysis using TopGO:** ``topgo``: ``True`` Hardware Configuration ---------------------- The execution was performed on an HPC node equipped with: - **CPU:** 256 cores - **Total RAM:** 100GB - **GPU Model:** NVIDIA A100-SXM4-80GB - **CUDA Version:** 12.2 - **Driver Version:** 535.230.02 - **Available GPUs:** 4 - **GPUs in use:** 1 Although more CPU cores were available, the execution was limited to **50 worker threads** for parallel processing. Summary ------- - **100 workers** allow parallel execution of queries. - **No sequence length filtering** (value set extremely high). - **CD-HIT at 95% sequence identity** to remove redundancy. - **Only proteins from GOA2022 are used as reference**. - **Euclidean distance metric** is applied. - **Batch size of 256** for all three embedding models. - **Execution performed on NVIDIA A100 GPUs with CUDA 12.2**. - **Only 1 GPU was used, despite 4 being available**. - **256 CPU cores available, but only 50 were used**. - **100GB of RAM available during execution**. Execution Times --------------- Embedding Generation ~~~~~~~~~~~~~~~~~~~~ The execution times for generating embeddings with different models are summarized below: +-------------------+-------------------+-------------------+ | Model | Total Time | Time per Sample | +===================+===================+===================+ | ESM | 18 min 21 sec | 12.59 ms/sample | +-------------------+-------------------+-------------------+ | ProSTT5 | 1 hr 51 min 37 sec| 76.55 ms/sample | +-------------------+-------------------+-------------------+ | ProtT5 | 2 hr 1 min 6 sec | 83.05 ms/sample | +-------------------+-------------------+-------------------+ Lookup Processing ~~~~~~~~~~~~~~~~~ Lookup operations were performed after embedding generation, with the following performance: +-------------------+-------------------+-------------------+ | Operation | Total Time | Time per Sample | +===================+===================+===================+ | Lookup | 9 hr 25 min 7 sec | 387.54 ms/sample | +-------------------+-------------------+-------------------+ Conclusions ----------- - ESM is the fastest model, processing each sample in 12.59 ms. - ProSTT5 and ProtT5 are significantly slower, with times of 76.55 ms and 83.05 ms per sample, respectively. - Lookup is the bottleneck, taking 4.5 times longer than embedding generation at 387.54 ms per sample. - Further optimization of lookup operations (e.g., parallelization improvements or better GPU utilization) could significantly reduce processing time.