Evaluation

This section reports the evaluation of FANTASIA v4.0 on the CAFA3 benchmark. Results are presented both per model and aggregated, following the official CAFA-evaluator methodology.

Data availability

All prediction files, evaluation outputs, and ground truth references used in this study are openly available in Zenodo:

FANTASIA v4.0 Results on CAFA3 Evaluation

This record includes:

  • fantasia_results_cafa3.zip — Predicted annotations generated with FANTASIA.

  • cafa3_evaluation.zip — Processed evaluation outputs.

  • cafa3_gt.tsv — Ground truth annotations.

  • cafa3_gt.fasta — Ground truth protein sequences.

Fmax scores

The main evaluation metric is Fmax, computed separately for each model and aggregation strategy:

Fmax by model

Model

Fmax

Prot-T5

0.822

ESM3c

0.815

Ankh3-Large

0.810

all_models

0.810

Prost-T5

0.809

ESM

0.800

ensemble

0.791

Note

Prot-T5 achieved the highest Fmax (0.822), followed closely by ESM3c (0.815). The ensemble strategy performed slightly worse than the best individual models.

Per-ontology performance

The following figures illustrate F-score distributions by ontology category:

  • BP (Biological Process):

    F-score per model in BP
  • CC (Cellular Component):

    F-score per model in CC
  • MF (Molecular Function):

    F-score per model in MF

Overall F-score

The aggregated F-scores across all ontology categories are summarized below:

Overall F-score per model

all_models vs ensemble

It is important to distinguish between the two aggregation strategies:

  • Ensemble (FANTASIA internal): Combines predictions from all enabled models, but for each protein keeps only the annotation with the best score. This favors high-confidence transfer at the cost of discarding secondary annotations.

  • All models (external CAFA evaluation): Brute-force evaluation of all prediction files (per-model outputs and the ensemble together). This effectively merges all annotations, leading to broader coverage but also potential redundancy.

Interpretation

  • Individual PLMs (especially Prot-T5 and ESM3c) outperform the internal ensemble in terms of Fmax.

  • The internal ensemble achieves higher recall (0.86–0.93) but at the cost of lower precision (0.51–0.73), which explains its lower Fmax compared to the best single models.

  • In contrast, the all_models evaluation (performed externally with the CAFA evaluator) merges predictions from all PLMs, including also the ensemble, providing an upper bound of annotation capacity with even broader coverage but also introducing redundancy, and therefore is not directly comparable to the internal ensemble.