Data Source =========== FANTASIA requires two types of data: - The input data, provided by the user (protein sequences in FASTA format). - The reference data (lookup table), which contains precomputed embeddings and GO annotations used for nearest-neighbor annotation transfer. This reference table was generated using the Protein Information System (PIS) [1]_, an integrated and automated platform that extracts protein data from UniProt, PDB, and GOA, and computes protein embeddings using modern Protein Language Models (PLMs). .. figure:: _static/PIS.png :alt: FANTASIA Pipeline Overview :align: center :width: 80% Embeddings from multiple models are computed for each protein sequence. Default Reference Dataset – FANTASIA V3 --------------------------------------- The lookup table used by FANTASIA V3 was generated in **late July 2025** using version **2.0.0** of the `Protein Information System (PIS) `_. It consists of a **PostgreSQL database backup** using the `pgvector `_ extension to store protein embeddings. This reference includes only **experimentally supported annotations**, extracted directly from UniProt. It is the default and recommended dataset for functional annotation in FANTASIA. Key improvements over previous versions (GOA2022, GOA2024, GOA2025 APRIL): - Fixed a bug that truncated embeddings to 512 dimensions. - Expanded model coverage from 3 to 5 PLMs, now including **Ankh3-Large** and **ESM3c**. - Replaced **ESM-1b (8M parameters)** with **ESM-2 (650M parameters)**. - Removed computational annotations; includes only **GO terms with experimental evidence codes**. Dataset Details --------------- - Total proteins: 127,546 - Total sequences: 124,397 - Total embeddings: 621,849 - Total GO annotations: 627,932 Included GO evidence codes (experimental only): - `EXP` – Inferred from Experiment - `IDA` – Inferred from Direct Assay - `IPI` – Inferred from Physical Interaction - `IMP` – Inferred from Mutant Phenotype - `IGI` – Inferred from Genetic Interaction - `IEP` – Inferred from Expression Pattern - `TAS` – Traceable Author Statement - `IC` – Inferred by Curator Supported Embedding Models -------------------------- - ESM-2 (650M parameters) - ProtT5-XL-UniRef50 (~1.2B parameters) - ProstT5 (~1.2B parameters) - Ankh3-Large (620M parameters) - ESM3c (Cambrian 600M) Each model provides high-dimensional representations of protein sequences used for functional similarity comparisons. Missing Proteins ---------------- A small number of proteins could not be processed on the Finisterrae III (CESGA) supercomputer due to memory limitations on 40 GB A100 GPUs. References ---------- .. [1] Protein Information System (PIS): https://github.com/frapercan/protein_information_system .. [2] GOA2025 reference database (default): https://zenodo.org/records/16582433