Embedder Module

Sequence Embedding Module

Overview

The Sequence Embedding Module provides a high-throughput workflow for computing and persisting protein sequence embeddings from FASTA input files. It serves as an orchestration layer around model loading, batching, task publication, and structured storage in HDF5.

Responsibilities

This module defines SequenceEmbedder, a concrete implementation built on top of protein_information_system.operation.embedding.sequence_embedding.SequenceEmbeddingManager. Its main responsibilities are:

  • Parsing input sequences from FASTA files, with optional truncation by length.

  • Enqueuing embedding tasks for all configured hidden-layer indices of each model.

  • Executing model-specific embedding routines with dynamic model loading.

  • Writing embeddings to HDF5 with a stable hierarchy and minimal metadata.

Processing Pipeline

  1. Ingest: Parse sequences from the configured FASTA file using Biopython.

  2. Batch: Partition sequences into queue batches (queue_batch_size) to control message size.

  3. Dispatch: For each enabled model, publish a task containing all batch sequences and the full list of requested layer indices.

  4. Embed: Load the appropriate model, tokenizer, and module, then compute embeddings.

  5. Persist: Store results in embeddings.h5 under per-accession, per-model, per-layer groups.

Input / Output

Input: A single- or multi-sequence FASTA file. Output: An HDF5 file named embeddings.h5 with the structure:

/accession_<ID>/
    /type_<embedding_type_id>/
        /layer_<k>/
            embedding   (dataset)
            shape       (attribute)
    sequence            (dataset, optional; stored once per accession)

Configuration

The module expects a dictionary conf with at least the following keys:

  • input : Path to the input FASTA file.

  • experiment_path : Output directory where embeddings.h5 will be written.

  • embedding.models : Model-level configuration: - enabled (bool) : Whether this model should be enqueued. - layer_index (list[int]) : Hidden-layer indices to extract.

  • embedding.batch_size (dict[str,int]) : Per-model batch sizes at embedding time.

  • embedding.queue_batch_size (int) : Number of sequences per published message.

  • embedding.max_sequence_length (int | None) : Optional truncation length.

Operational Notes

  • No DB dependency: Enqueueing does not query a database or require sequence IDs in advance.

  • All layers extracted: For each model, all configured layers are included (no aggregation).

  • Device selection: Defaults to "cuda" unless overridden by conf["embedding"]["device"].

  • Idempotency: Existing per-layer datasets are skipped rather than overwritten.

Error Handling & Logging

  • Missing FASTA or I/O errors are raised and logged.

  • Inconsistent batches (multiple embedding_type_id values) trigger a ValueError.

  • Each storage operation logs whether a dataset was created or skipped.

Dependencies

  • Biopython (FASTA parsing via Bio.SeqIO).

  • h5py (structured storage).

  • Model registry and dynamic loading provided by protein_information_system.operation.embedding.sequence_embedding.SequenceEmbeddingManager.

Public API

Intended Use

This module is the first stage in an embedding-driven functional annotation pipeline. Downstream consumers typically perform similarity search, annotation transfer, or clustering using the stored embeddings.

class fantasia.src.embedder.SequenceEmbedder(*args: Any, **kwargs: Any)

Bases: SequenceEmbeddingManager

High-throughput computation of protein sequence embeddings from FASTA input.

The SequenceEmbedder orchestrates model loading, batching, optional sequence truncation, and storage of per-layer embeddings into HDF5. It supports multiple embedding models in parallel and produces structured outputs suitable for downstream similarity search, annotation transfer, or clustering.

Parameters:
  • conf (dict) – Configuration dictionary with input paths, model definitions, batch sizes, and optional filters.

  • current_date (str) – Timestamp string for naming outputs and logs.

fasta_path

Path to the input FASTA file with sequences to embed.

Type:

str

experiment_path

Directory where embeddings.h5 and logs are written.

Type:

str

queue_batch_size

Number of sequences per published task message.

Type:

int

max_sequence_length

Optional truncation length (0 disables truncation).

Type:

int

batch_sizes

Per-model embedding batch sizes.

Type:

dict

model_instances

Dynamically loaded model objects, keyed by embedding_type_id.

Type:

dict

tokenizer_instances

Tokenizer objects, keyed by embedding_type_id.

Type:

dict

types

Metadata for each enabled model (e.g. thresholds, batch size, module).

Type:

dict

results

In-memory embedding results (used for aggregation/debugging).

Type:

list

enqueue() None

Read the input FASTA and enqueue all sequences for all enabled models, emitting all configured layers for each model in a single message per model.

process(task_data)

Computes embeddings for a batch of protein sequences using a specific model.

Each task in the batch must reference the same embedding_type_id, which is used to retrieve the appropriate model, tokenizer, and embedding module. The method delegates the actual embedding logic to the dynamically loaded module.

Parameters:

task_data (list of dict) – A batch of embedding tasks. Each task should include: - ‘sequence’: str, amino acid sequence. - ‘accession’: str, identifier of the sequence. - ‘embedding_type_id’: str, key for the embedding model.

Returns:

A list of embedding records. Each record includes the embedding vector, shape, accession, and embedding_type_id.

Return type:

list of dict

Raises:
  • ValueError – If the batch includes multiple embedding types.

  • Exception – For any other error during embedding generation.

store_entry(results)

Persist per-layer embeddings into an HDF5 file using a stable, idempotent group hierarchy.