Key Features¶

✅ Availability of different Embedding Models

Currently supports the protein language models: ProtT5, ProstT5, and ESM2 for sequence representation.
🔍 Filtering by sequence similarity

Filters out sequences by sequence similarity using the standard CD-HIT, enabling redundancy levels through an adjustable threshold. This is relevant for reliable benchmarking and evaluation of the methods.
💾 Optimized Data Storage

Embeddings are stored in HDF5 format for input sequences, while similarity lookups are performed in a vector database (pgvector in PostgreSQL) for fast retrieval.
🚀 Efficient Similarity Lookup

Performs high-speed searches using pgvector, enabling accurate annotation based on embedding similarity.
🔬 Functional Annotation by Similarity in the Embedding space

Assigns Gene Ontology (GO) terms (Molecular Function, Biological Process, and Cellular Component) to proteins based on embedding space similarity. Only the most specific term is transferred and only the CAFA’s standards for Experimental evidence are transferred (EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC).