Key Features ------------ - **✅ Advanced Embedding Models** Supports protein language models: **ProtT5**, **ProstT5**, and **ESM2** for sequence representation. - **🔁 Redundancy Filtering** Reduces bias by removing highly similar sequences from the reference database using **MMSeqs2**. Configurable thresholds allow clustering proteins based on identity and coverage. This improves generalization by avoiding annotation from near-identical entries. - **🌿 Taxonomy-Based Filtering** Enables exclusion or inclusion of specific taxa from the annotation reference set based on **NCBI Taxonomy IDs**. Supports descendant expansion for clade-level filtering. Essential for studies targeting particular lineages or excluding over-represented model organisms. - **💾 Optimized Data Storage** Embeddings are stored in **HDF5 format** for input sequences. The reference table, however, is hosted in a **public relational PostgreSQL database** using **pgvector**. - **🚀 Efficient Similarity Lookup** Performs high-speed searches using **in-memory computations**. Reference vectors are retrieved from a **PostgreSQL database with pgvector** for comparison. - **🔬 Functional Annotation by Similarity** Assigns Gene Ontology (GO) terms to proteins based on **embedding space similarity**, leveraging pre-trained embeddings.