Key Features
------------

- **✅ Advanced Embedding Models**
  Supports protein language models: **ProtT5**, **ProstT5**, and **ESM2** for sequence representation.

- **🔁 Redundancy Filtering**
  Reduces bias by removing highly similar sequences from the reference database using **MMSeqs2**.
  Configurable thresholds allow clustering proteins based on identity and coverage.
  This improves generalization by avoiding annotation from near-identical entries.

- **🌿 Taxonomy-Based Filtering**
  Enables exclusion or inclusion of specific taxa from the annotation reference set based on **NCBI Taxonomy IDs**.
  Supports descendant expansion for clade-level filtering. Essential for studies targeting particular lineages
  or excluding over-represented model organisms.

- **💾 Optimized Data Storage**
  Embeddings are stored in **HDF5 format** for input sequences. The reference table, however, is hosted in a **public
  relational PostgreSQL database** using **pgvector**.

- **🚀 Efficient Similarity Lookup**
  Performs high-speed searches using **in-memory computations**. Reference vectors are retrieved from a **PostgreSQL
  database with pgvector** for comparison.

- **🔬 Functional Annotation by Similarity**
  Assigns Gene Ontology (GO) terms to proteins based on **embedding space similarity**, leveraging pre-trained
  embeddings.