Benchmarking¶

Objective¶

This use case provides a basic implementation for benchmarking execution in functional protein annotation. The goal is to generate annotations for proteins NOT present in the reference sets, allowing performance comparisons between different methods in downstream analyses.

An example of this approach is published in NARGAB [NARGAB2024], where annotations were evaluated using various metrics.

The reference datasets consist of proteins annotated in the three Gene Ontology (GO) domains:

F: Molecular Function
B: Biological Process
C: Cellular Component

The evidence codes considered follow the CAFA standards, as defined by the CAFA Initiative [CAFA] and the Gene Ontology Annotation Database (GOA) [GOA].

Step-by-Step Procedure¶

Extract a reference proteome, e.g., mouse (*Mus musculus*).
Remove mouse sequences from the reference dataset (both sequences and embeddings) to avoid model biases.
Remove identical sequences from the reference dataset (both sequences and embeddings) to minimize the impact of closely related species.
Execute the analysis pipeline and transfer annotations based on embedding similarity (GOA2024). - GOA2022 is available for comparison with previously published analyses.

We use the default settings from GoPredSim [GoPredSim], as implemented in the GoPredSim repository, for consistency with the original methods, though many parameters can be adjusted.

Input Data¶

Input data must be protein sequences in FASTA format, as described in the EMBOSS documentation [EMBOSS], concatenated into a single file.

Example of FILENAME_test.fasta:¶

>tr|A0A087WPB2|A0A087WPB2_MOUSE MyoD family inhibitor domain containing OS=Mus musculus OX=10090 GN=Mdfic PE=1 SV=1
MSCAGEALAPGPAEQQCPVEAGGGRLGSPAHEACNEDNTEKDKRPATSGHTRCGLMRDQS
...
>tr|A0A087WQA5|A0A087WQA5_MOUSE TAR DNA binding protein (Fragment) OS=Mus musculus OX=10090 GN=Tardbp PE=1 SV=1
XDETDASSAVKVKRAVQKTSDLIVLGLPWKTTEQDLKDYFSTFGEVLMVQVKKDLKTGHS
...

Note: The validation of input files to ensure they contain proteins and not DNA is automatically handled within the pipeline.

Configuration Parameters¶

Pipeline Configuration¶

Below, copy the code to a `benchmark_config.yaml` file in a text editor. The full configuration file, including additional essential system parameters, can be found at [protein-metamorphisms-is/config/config.yaml](https://github.com/CBBIO/protein-metamorphisms-is/blob/main/protein_metamorphisms_is/config/config.yaml).

# Path to the input FASTA file containing protein sequences
input: data_sample/FILENAME_test.fasta

# Reference tag used for lookup operations.
lookup_reference_tag: GOA2022  # "0" (enables GOA2024) | "GOA2024" (includes GOA2022)

# Number of closest proteins to consider in the lookup.
limit_per_entry: 1  # k=1 is used in the GoPredSim method.

# Prefix for output file names.
fantasia_prefix: FILENAME_test_Prot_100_1.2

# Sequence length filtering threshold.
length_filter: 5000000  # Refers to the length of the sequence in amino acids. A high value means no filtering.

# Redundancy filtering threshold (removes identical sequences).
redundancy_filter: 1  # "0" (no filtering) | "1-0.5" (100%-50% redundancy removal)

Description of Parameters:¶

lookup_reference_tag: Defines which reference database is used for annotation lookup. Allows switching between GOA2022 and GOA2024 [GOA] to assess differences in methods.
limit_per_entry: Determines how many similar proteins are considered for annotation transfer. k=1 follows GoPredSim [GoPredSim] but can be adjusted.
length_filter: Set to a high value to avoid sequence length filtering by default. It can be adjusted to remove abnormally long proteins if needed. Our new implementation correctly handles sequences longer than 5K amino acids.
redundancy_filter: Controls the removal of identical sequences to prevent biases in method comparisons. This is relevant to avoid biases.

Embedding Configuration¶

embedding:
  distance_metric: "<->"  # "<=>" (cosine) | "<->" (Euclidean, default)
  models:
    esm:
      enabled: True
      distance_threshold: 0
      batch_size: 32
    prost_t5:
      enabled: True
      distance_threshold: 0
      batch_size: 32
    prot_t5:
      enabled: True
      distance_threshold: 0
      batch_size: 32

Model References: - ESM2 [ESM2] - ProtT5 [ProtT5] - ProstT5 [ProstT5] - CD-HIT [CDHIT]

Functional Analysis¶

# Enable or disable file formatting for TOPGO downstream analyses
topgo: True  # "True" (enabled) | "False" (disabled)

Reference: TOPGO [TOPGO]

Results¶

Two main output files are generated:

FILENAME_test.csv → Contains detailed information for each query protein.
FILENAME_test.TOPGO.txt → Contains annotations formatted for TOPGO software.

These results can be used to evaluate prediction accuracy and compare the performance of different methods.

References¶

[NARGAB2024]

Example of benchmarking approach published in NARGAB, available at: DOI: 10.1093/nargab/lqae078.

[CAFA]

CAFA Initiative, available at: https://biofunctionprediction.org/cafa/.

[GOA] (1,2)

Gene Ontology Annotation Database (GOA), available at: https://www.ebi.ac.uk/GOA/downloads.

[GoPredSim] (1,2)

GoPredSim repository, available at: https://github.com/Rostlab/goPredSim/blob/master/file_utils.py.

[EMBOSS]

EMBOSS documentation, available at: http://emboss.open-bio.org/html/use/apas01.html.

[ESM2]

ESM2 model on Hugging Face, available at: https://huggingface.co/facebook/esm2_t36_3B_UR50D.

[ProtT5]

ProtT5 model on Hugging Face, available at: https://huggingface.co/Rostlab/prot_t5_xl_uniref50.

[ProstT5]

ProstT5 model on Hugging Face, available at: https://huggingface.co/Rostlab/ProstT5.

[CDHIT]

CD-HIT tool, available at: https://www.bioinformatics.org/cd-hit/.

[TOPGO]

TOPGO software, available at: https://bioconductor.org/packages/release/bioc/html/topGO.html.