Functional Annotation

Objective

This use case describes the functional annotation process in FANTASIA. The goal is to predict functional annotations for unknown sequences, enabling their classification based on similarity to known protein functions.

FANTASIA leverages embedding-based approaches to transfer functional information from well-characterized proteins to unannotated sequences. This method provides a reliable annotation strategy, especially for proteins with no clear homologs.

The annotation is performed using the three Gene Ontology (GO) domains:

  • F: Molecular Function

  • B: Biological Process

  • C: Cellular Component

Annotations are assigned based on similarity to reference datasets following CAFA standards:

  • EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC

Functional Annotation Procedure

  1. Input a set of unknown protein sequences.

  2. Generate embeddings for each sequence using ESM, ProtT5, or other models.

  3. Compare embeddings against reference datasets with known functional annotations.

  4. Assign GO terms to unknown sequences based on the closest matches.

  5. Export annotation results for further analysis or integration into biological workflows.

Input Data

The input must be protein sequences in FASTA format, concatenated into a single file.

Example of FILENAME_query.fasta:

>query1 Unknown protein sequence
MVKFTASDLKQGERTSLP...
>query2 Hypothetical protein
MLFTGASDVKNQTWPAL...

Note: Ensure the input consists of amino acid sequences, not DNA.

Functional Annotation Configuration

Pipeline Configuration

# Path to the input FASTA file containing unknown protein sequences
input: data_sample/FILENAME_query.fasta

# Reference tag used for lookup operations.
lookup_reference_tag: GOA2024  # Accepted values: "0" (no filtering) | "GOA2024" (excludes GOA2022)

# Number of closest proteins to consider for annotation transfer.
limit_per_entry: 5  # Default is 5, can be optimized.

# Prefix for output file names.
fantasia_prefix: FILENAME_query_annotated

Embedding Configuration

embedding:
  distance_metric: "<->"  # Options: "<=>" (cosine) | "<->" (Euclidean, default)
  models:
    esm:
      enabled: True
      distance_threshold: 0
      batch_size: 32
    prost_t5:
      enabled: True
      distance_threshold: 0
      batch_size: 32
    prot_t5:
      enabled: True
      distance_threshold: 0
      batch_size: 32

Functional Analysis

# Enable or disable file formatting for TOPGO downstream analyses
topgo: true  # Accepted values: "true" (enabled) | "false" (disabled)

Results

Two main output files are generated:

  1. FILENAME_query.csv → Contains predicted annotations for each sequence.

  2. FILENAME_query.TOPGO.txt → Contains annotations formatted for TOPGO software.

These results enable further downstream analysis, including enrichment studies and pathway predictions.