Key Features

  • ✅ Availability of different Embedding Models

    Currently supports the protein language models: ProtT5, ProstT5, and ESM2 for sequence representation.

  • 🔍 Filtering by sequence similarity

    Filters out sequences by sequence similarity using the standard CD-HIT, enabling redundancy levels through an adjustable threshold. This is relevant for reliable benchmarking and evaluation of the methods.

  • 💾 Optimized Data Storage

    Embeddings are stored in HDF5 format for input sequences, while similarity lookups are performed in a vector database (pgvector in PostgreSQL) for fast retrieval.

  • 🚀 Efficient Similarity Lookup

    Performs high-speed searches using pgvector, enabling accurate annotation based on embedding similarity.

  • 🔬 Functional Annotation by Similarity in the Embedding space

    Assigns Gene Ontology (GO) terms (Molecular Function, Biological Process, and Cellular Component) to proteins based on embedding space similarity. Only the most specific term is transferred and only the CAFA’s standards for Experimental evidence are transferred (EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC).