Embedder Module¶
Sequence Embedding Module¶
Overview¶
The Sequence Embedding Module provides a high-throughput workflow for computing and persisting protein sequence embeddings from FASTA input files. It serves as an orchestration layer around model loading, batching, task publication, and structured storage in HDF5.
Responsibilities¶
This module defines SequenceEmbedder, a concrete implementation built on top of
protein_information_system.operation.embedding.sequence_embedding.SequenceEmbeddingManager.
Its main responsibilities are:
Parsing input sequences from FASTA files, with optional truncation by length.
Enqueuing embedding tasks for all configured hidden-layer indices of each model.
Executing model-specific embedding routines with dynamic model loading.
Writing embeddings to HDF5 with a stable hierarchy and minimal metadata.
Processing Pipeline¶
Ingest: Parse sequences from the configured FASTA file using Biopython.
Batch: Partition sequences into queue batches (
queue_batch_size) to control message size.Dispatch: For each enabled model, publish a task containing all batch sequences and the full list of requested layer indices.
Embed: Load the appropriate model, tokenizer, and module, then compute embeddings.
Persist: Store results in
embeddings.h5under per-accession, per-model, per-layer groups.
Input / Output¶
Input: A single- or multi-sequence FASTA file.
Output: An HDF5 file named embeddings.h5 with the structure:
/accession_<ID>/
/type_<embedding_type_id>/
/layer_<k>/
embedding (dataset)
shape (attribute)
sequence (dataset, optional; stored once per accession)
Configuration¶
The module expects a dictionary conf with at least the following keys:
input: Path to the input FASTA file.experiment_path: Output directory whereembeddings.h5will be written.embedding.models: Model-level configuration: -enabled(bool) : Whether this model should be enqueued. -layer_index(list[int]) : Hidden-layer indices to extract.embedding.batch_size(dict[str,int]) : Per-model batch sizes at embedding time.embedding.queue_batch_size(int) : Number of sequences per published message.embedding.max_sequence_length(int | None) : Optional truncation length.
Operational Notes¶
No DB dependency: Enqueueing does not query a database or require sequence IDs in advance.
All layers extracted: For each model, all configured layers are included (no aggregation).
Device selection: Defaults to
"cuda"unless overridden byconf["embedding"]["device"].Idempotency: Existing per-layer datasets are skipped rather than overwritten.
Error Handling & Logging¶
Missing FASTA or I/O errors are raised and logged.
Inconsistent batches (multiple
embedding_type_idvalues) trigger aValueError.Each storage operation logs whether a dataset was created or skipped.
Dependencies¶
Public API¶
SequenceEmbedder.enqueue()Read FASTA, batch sequences, and enqueue per-model tasks with all configured layers.
SequenceEmbedder.process()Load the appropriate model/tokenizer/module, embed a batch, and return records.
SequenceEmbedder.store_entry()Persist per-layer embeddings and metadata into
embeddings.h5.
Intended Use¶
This module is the first stage in an embedding-driven functional annotation pipeline. Downstream consumers typically perform similarity search, annotation transfer, or clustering using the stored embeddings.
- class fantasia.src.embedder.SequenceEmbedder(*args: Any, **kwargs: Any)¶
Bases:
SequenceEmbeddingManagerHigh-throughput computation of protein sequence embeddings from FASTA input.
The
SequenceEmbedderorchestrates model loading, batching, optional sequence truncation, and storage of per-layer embeddings into HDF5. It supports multiple embedding models in parallel and produces structured outputs suitable for downstream similarity search, annotation transfer, or clustering.- Parameters:
conf (dict) – Configuration dictionary with input paths, model definitions, batch sizes, and optional filters.
current_date (str) – Timestamp string for naming outputs and logs.
- fasta_path¶
Path to the input FASTA file with sequences to embed.
- Type:
str
- experiment_path¶
Directory where
embeddings.h5and logs are written.- Type:
str
- queue_batch_size¶
Number of sequences per published task message.
- Type:
int
- max_sequence_length¶
Optional truncation length (0 disables truncation).
- Type:
int
- batch_sizes¶
Per-model embedding batch sizes.
- Type:
dict
- model_instances¶
Dynamically loaded model objects, keyed by
embedding_type_id.- Type:
dict
- tokenizer_instances¶
Tokenizer objects, keyed by
embedding_type_id.- Type:
dict
- types¶
Metadata for each enabled model (e.g. thresholds, batch size, module).
- Type:
dict
- results¶
In-memory embedding results (used for aggregation/debugging).
- Type:
list
- enqueue() None¶
Read the input FASTA and enqueue all sequences for all enabled models, emitting all configured layers for each model in a single message per model.
- process(task_data)¶
Computes embeddings for a batch of protein sequences using a specific model.
Each task in the batch must reference the same embedding_type_id, which is used to retrieve the appropriate model, tokenizer, and embedding module. The method delegates the actual embedding logic to the dynamically loaded module.
- Parameters:
task_data (list of dict) – A batch of embedding tasks. Each task should include: - ‘sequence’: str, amino acid sequence. - ‘accession’: str, identifier of the sequence. - ‘embedding_type_id’: str, key for the embedding model.
- Returns:
A list of embedding records. Each record includes the embedding vector, shape, accession, and embedding_type_id.
- Return type:
list of dict
- Raises:
ValueError – If the batch includes multiple embedding types.
Exception – For any other error during embedding generation.
- store_entry(results)¶
Persist per-layer embeddings into an HDF5 file using a stable, idempotent group hierarchy.