Experiments Setup

Scope

This page explains how to plan and configure a FANTASIA v4 experiment. It focuses on execution modes, similarity-search controls, taxonomy filtering, and embedding settings, with practical guidance for batching and model/layer selection.

Quick sanity check

Run the bundled sample from the project root, so the default relative paths (e.g., ./config/experiment.yaml and ./fantasia/constants.yaml) are found:

cd /path/to/FANTASIA
fantasia initialize
fantasia run

Important

If you execute the CLI outside the project root, you must provide explicit paths (at least to the experiment config), for example:

fantasia initialize --config /abs/path/to/experiment.yaml
fantasia run --config /abs/path/to/experiment.yaml

Plan your experiment

  • Choose the execution mode.

  • Select models and hidden layers.

  • Size lookup batches according to available RAM/VRAM.

  • Decide on neighbor budget and filters (redundancy, taxonomy).

  • Set a meaningful prefix to preserve traceability.

Inputs & execution modes

input / only_lookup
  • only_lookup: falseinput is FASTA; the pipeline computes embeddings (HDF5) and then runs lookup.

  • only_lookup: trueinput is HDF5 containing per-layer embeddings; only lookup runs.

prefix
  • Name used for the experiment folder and outputs.

  • Keep it unique and descriptive to maintain a clean trace of results.

Similarity-search settings

batch_size
  • Lookup processing batch size for distance computation (RAM/VRAM bound).

  • Larger batches increase throughput but may exhaust memory.

limit_per_entry
  • Number of nearest neighbors retained per query (per model/layer).

  • Increasing this typically raises recall; it may lower precision and increase runtime.

embedding.distance_metric
  • Distance metric used by the lookup stage (configured under embedding).

  • Recommended: cosine (scale-invariant, better comparability across models/layers).

  • Note: Euclidean distance depends on vector magnitudes; unless vectors are normalized, cross-model/layer comparisons can be distorted.

redundancy_filter / alignment_coverage
  • Optional MMseqs2-based redundancy control on the reference side.

  • redundancy_filter: identity threshold (e.g., 0.95 = 95%).

  • alignment_coverage: minimum fractional coverage to accept an alignment.

  • Set both to 0 if redundancy control is not required.

Taxonomy filtering

taxonomy_ids_to_exclude
  • List of taxonomy IDs to exclude from the reference set.

taxonomy_ids_included_exclusively
  • If non-empty, only these IDs are included (overrides the exclude list).

get_descendants
  • If true, the filter applies to all descendants of the listed IDs (NCBI Taxonomy expansion).

Examples
  • 559292Saccharomyces cerevisiae

  • 6239Caenorhabditis elegans

Embedding settings (query side)

embedding.device
  • cuda or cpu; GPU is recommended when available.

embedding.queue_batch_size
  • Enqueuing granularity for RabbitMQ/internal queues.

  • This is not the per-model forward-pass batch size.

  • Per-model batches (see below) should be strictly smaller than this value; if not, the effective batch will be bounded by the queue packet size.

embedding.max_sequence_length
  • Truncation length (tokens/residues) applied before embedding.

  • 0 = no truncation; sequences exceeding the model’s internal limit will error out and no embedding will be written.

  • Positive values (e.g., 512, 1024, …) depend on VRAM and your use case.

embedding.models.<ModelKey>.batch_size
  • Per-model embedding batch size (forward pass).

  • Must be strictly less than embedding.queue_batch_size.

embedding.models.<ModelKey>.layer_index
  • Hidden layers to extract for that model.

  • Prefer late layers for function-oriented signals; optionally add 1–2 intermediate layers for robustness.

  • Ensure the indices exist in the selected reference (lookup) table; non-existing layers will not yield neighbors.

embedding.models.<ModelKey>.distance_threshold
  • Optional per-model cutoff applied before capping to limit_per_entry.

  • 0 → pure top-k selection; > 0 → filter by threshold first, then cap by top-k.

Model & layer selection

FANTASIA processes queries per model and per layer. At any time it holds in memory only the current (model, layer): the query embeddings for that layer plus the in-memory reference table for that model.

Memory model (what actually sits in RAM/VRAM)

  • Peak memory is driven by the selected model (and current layer), not by how many models/layers are configured overall.

  • More models/layers mainly increase wall-clock time, not peak memory—provided each (model, layer) is processed individually.

  • Each (model, layer) must fit simultaneously with: the model’s reference table, the lookup batch being compared, and temporary distance buffers.

Selection guidelines

  • Choose models for their predictive value; plan around the largest model’s reference footprint you enable.

  • Prefer late layers first (often stronger functional signal); add 1–2 intermediate layers if you need extra stability across datasets.

  • Keep limit_per_entry reasonable; raising it increases recall but may reduce precision.

Batching & limits

  • Lookup OOM → lower global batch_size (and, if needed, reduce limit_per_entry).

  • Embedding OOM → lower per-model embedding.models.<ModelKey>.batch_size or select fewer layers.

  • Very long sequences cost more memory/time; cap via embedding.max_sequence_length if required.

  • embedding.queue_batch_size controls enqueue granularity (not the forward batch). It should be greater than every per-model batch_size to avoid being bottlenecked by the queue packet size.

Verification

  • Ensure requested layer_index values exist in the chosen lookup table; non-existent layers yield no neighbors. To verify, query PIS (SQL/ORM) for available (model, layer) pairs or consult the loaded reference metadata.