Key Features
------------

- **✅ Availability of different Embedding Models**

  Currently supports the protein language models: **ProtT5**, **ProstT5**, and **ESM2** for sequence representation.

- **🔍 Filtering by sequence similarity**

  Filters out sequences by sequence similarity using the standard **CD-HIT**, enabling redundancy levels through an adjustable threshold. This is relevant for reliable benchmarking and evaluation of the methods.

- **💾 Optimized Data Storage**

  Embeddings are stored in **HDF5 format** for input sequences, while similarity lookups are performed in a vector database (**pgvector in PostgreSQL**) for fast retrieval.

- **🚀 Efficient Similarity Lookup**

  Performs high-speed searches using **pgvector**, enabling accurate annotation based on embedding similarity. 

- **🔬 Functional Annotation by Similarity in the Embedding space**

  Assigns Gene Ontology (GO) terms (Molecular Function, Biological Process, and Cellular Component) to proteins based on **embedding space similarity**. Only the most specific term is transferred and only the CAFA's standards for Experimental evidence are transferred (EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC).