Key Features ------------ - **✅ Availability of different Embedding Models** Currently supports the protein language models: **ProtT5**, **ProstT5**, and **ESM2** for sequence representation. - **🔍 Filtering by sequence similarity** Filters out sequences by sequence similarity using the standard **CD-HIT**, enabling redundancy levels through an adjustable threshold. This is relevant for reliable benchmarking and evaluation of the methods. - **💾 Optimized Data Storage** Embeddings are stored in **HDF5 format** for input sequences, while similarity lookups are performed in a vector database (**pgvector in PostgreSQL**) for fast retrieval. - **🚀 Efficient Similarity Lookup** Performs high-speed searches using **pgvector**, enabling accurate annotation based on embedding similarity. - **🔬 Functional Annotation by Similarity in the Embedding space** Assigns Gene Ontology (GO) terms (Molecular Function, Biological Process, and Cellular Component) to proteins based on **embedding space similarity**. Only the most specific term is transferred and only the CAFA's standards for Experimental evidence are transferred (EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC).