.. _functional_annotation: ========================================== Functional Annotation ========================================== Objective --------- This use case describes the **functional annotation process** in **FANTASIA**. The goal is to predict **functional annotations for unknown sequences**, enabling their classification based on similarity to known protein functions. FANTASIA leverages **embedding-based approaches** to transfer functional information from well-characterized proteins to unannotated sequences. This method provides a reliable annotation strategy, especially for proteins with no clear homologs. The annotation is performed using the three **Gene Ontology (GO)** domains: - **F**: Molecular Function - **B**: Biological Process - **C**: Cellular Component Annotations are assigned based on similarity to reference datasets following **CAFA** standards: - **EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC** Functional Annotation Procedure -------------------------------- 1. **Input a set of unknown protein sequences**. 2. **Generate embeddings** for each sequence using **ESM, ProtT5, or other models**. 3. **Compare embeddings** against reference datasets with known functional annotations. 4. **Assign GO terms** to unknown sequences based on the closest matches. 5. **Export annotation results** for further analysis or integration into biological workflows. Input Data ---------- The input must be **protein sequences in FASTA format**, concatenated into a single file. Example of **FILENAME_query.fasta**: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: text >query1 Unknown protein sequence MVKFTASDLKQGERTSLP... >query2 Hypothetical protein MLFTGASDVKNQTWPAL... **Note:** Ensure the input consists of amino acid sequences, not DNA. Functional Annotation Configuration ----------------------------------- Pipeline Configuration ^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml # Path to the input FASTA file containing unknown protein sequences input: data_sample/FILENAME_query.fasta # Reference tag used for lookup operations. lookup_reference_tag: GOA2024 # Accepted values: "0" (no filtering) | "GOA2024" (excludes GOA2022) # Number of closest proteins to consider for annotation transfer. limit_per_entry: 5 # Default is 5, can be optimized. # Prefix for output file names. fantasia_prefix: FILENAME_query_annotated Embedding Configuration ^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml embedding: distance_metric: "<->" # Options: "<=>" (cosine) | "<->" (Euclidean, default) models: esm: enabled: True distance_threshold: 0 batch_size: 32 prost_t5: enabled: True distance_threshold: 0 batch_size: 32 prot_t5: enabled: True distance_threshold: 0 batch_size: 32 Functional Analysis ^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml # Enable or disable file formatting for TOPGO downstream analyses topgo: true # Accepted values: "true" (enabled) | "false" (disabled) Results ------------------ Two main output files are generated: 1. **FILENAME_query.csv** → Contains predicted annotations for each sequence. 2. **FILENAME_query.TOPGO.txt** → Contains annotations formatted for **TOPGO** software. These results enable further downstream analysis, including enrichment studies and pathway predictions.