Execution Benchmarks¶

The following table summarizes execution times for FANTASIA functional annotation experiments. Each experiment corresponds to the complete proteome of a model organism. All were executed on a single NVIDIA A100 GPU with identical pipeline configuration. The main difference between runs is the number of nearest neighbors (k) used in the lookup phase.

Log file	Org.	#Sequences	k	ESM (s)	ProSTT5 (s)	ProtT5 (s)	Lookup (s)	Postproc (s)	Total (s)
fantasia_5634510_1.err	MOUSE	54,727	1	926.9	943.1	937.6	2929.2	954.1	6691.0
fantasia_5634510_2.err	YEAST	6,066	1	127.4	129.3	128.5	485.3	84.3	954.8
fantasia_5634510_3.err	DROME	22,010	1	466.4	473.5	469.4	1518.3	452.1	3379.6
fantasia_5634510_4.err	ARATH	39,275	1	833.5	846.2	844.6	2648.9	594.0	5767.2
fantasia_5634510_5.err	MOUSE	54,727	5	963.7	978.7	970.4	3036.3	3079.9	9028.9
fantasia_5634510_6.err	YEAST	6,066	5	131.2	132.9	131.3	493.7	340.9	1230.0
fantasia_5634510_7.err	DROME	22,010	5	517.4	522.3	511.9	1663.0	1654.7	4869.2
fantasia_5634510_8.err	ARATH	39,275	5	872.0	885.8	870.9	2755.6	2210.0	7594.3

Note

Embeddings were generated sequentially for each model (ESM, ProSTT5, ProtT5).
Lookup and post-processing duration increases substantially with higher k values.
Post-processing includes hit collapsing, reliability scoring and pairwise alignments.

Warning

The lookup phase operates over the combined predictions of the three embedding models (ESM, ProSTT5, ProtT5). Therefore, overall execution time—particularly in post-processing—will vary if models are added or removed from the pipeline.

Performance Analysis¶

All experiments queried a shared reference table containing 126,582 sequence embeddings and 623,134 GO term annotations, corresponding exclusively to proteins with experimental evidence (i.e., excluding annotations inferred electronically).

The number of neighbors (k) used during the lookup step has a direct impact on execution time, especially in the post-processing phase. Increasing k results in:

More hits to evaluate per query.
Greater redundancy in annotations.
A larger number of pairwise alignments.

A clear linear trend is observed in the lookup and post-processing durations as k increases from 1 to 5, while the embedding generation time remains unaffected.

Key Observations:

For large proteomes like Mus musculus (~55k sequences), increasing k from 1 to 5 adds over 2000 seconds to the post-processing step alone.
For small proteomes like S. cerevisiae (~6k sequences), this overhead is proportionally smaller but still significant (~+250s in post-processing).

This suggests a trade-off between annotation depth (more neighbors = more GO candidates) and execution efficiency, depending on the target proteome and available computational resources.

General Statistics¶

The following metrics summarize the computational scope and cost of the full set of benchmarking experiments:

Total number of sequences processed: 244,156
Total runtime across all experiments: 39,515 seconds (~11 h)
Average runtime per sequence: 0.16 seconds

Average time per sequence (ms):

ESM embedding: 19.8 ms
ProSTT5 embedding: 20.1 ms
ProtT5 embedding: 19.9 ms
Lookup phase: 63.6 ms
Post-processing: 38.4 ms

These values reflect the average per-sequence cost of each model and pipeline stage when annotating full proteomes from different model organisms under controlled hardware conditions.