Experiments Setup¶
Scope¶
This page explains how to plan and configure a FANTASIA v4 experiment. It focuses on execution modes, similarity-search controls, taxonomy filtering, and embedding settings, with practical guidance for batching and model/layer selection.
Recommended defaults¶
Note
The default configuration shipped in the repository is the reference for all runs. It is the setup we are most comfortable with and the one that has delivered the best overall performance (accuracy vs. runtime vs. memory) in practice. Unless you have a specific reason to deviate, use the defaults as-is.
Quick sanity check¶
Run the bundled sample from the project root, so the default relative paths
(e.g., ./config/experiment.yaml and ./fantasia/constants.yaml) are found:
cd /path/to/FANTASIA
fantasia initialize
fantasia run
Important
If you execute the CLI outside the project root, you must provide explicit paths (at least to the experiment config), for example:
fantasia initialize --config /abs/path/to/experiment.yaml
fantasia run --config /abs/path/to/experiment.yaml
Plan your experiment¶
Choose the execution mode.
Select models and hidden layers.
Size lookup batches according to available RAM/VRAM.
Decide on neighbor budget and filters (redundancy, taxonomy).
Set a meaningful prefix to preserve traceability.
Inputs & execution modes¶
- input / only_lookup
only_lookup: false→inputis FASTA; the pipeline computes embeddings (HDF5) and then runs lookup.only_lookup: true→inputis HDF5 containing per-layer embeddings; only lookup runs.
- prefix
Name used for the experiment folder and outputs.
Keep it unique and descriptive to maintain a clean trace of results.
Similarity-search settings¶
- batch_size
Lookup processing batch size for distance computation (RAM/VRAM bound).
Larger batches increase throughput but may exhaust memory.
- limit_per_entry
Number of nearest neighbors retained per query (per model/layer).
Increasing this typically raises recall; it may lower precision and increase runtime.
- embedding.distance_metric
Distance metric used by the lookup stage (configured under
embedding).Recommended: cosine (scale-invariant, better comparability across models/layers).
Note: Euclidean distance depends on vector magnitudes; unless vectors are normalized, cross-model/layer comparisons can be distorted.
- redundancy_filter / alignment_coverage
Optional MMseqs2-based redundancy control on the reference side.
redundancy_filter: identity threshold (e.g.,0.95= 95%).alignment_coverage: minimum fractional coverage to accept an alignment.Set both to 0 if redundancy control is not required.
Taxonomy filtering¶
- taxonomy_ids_to_exclude
List of taxonomy IDs to exclude from the reference set.
- taxonomy_ids_included_exclusively
If non-empty, only these IDs are included (overrides the exclude list).
- get_descendants
If
true, the filter applies to all descendants of the listed IDs (NCBI Taxonomy expansion).
- Examples
559292— Saccharomyces cerevisiae6239— Caenorhabditis elegans
Embedding settings (query side)¶
- embedding.device
cudaorcpu; GPU is recommended when available.
- embedding.queue_batch_size
Enqueuing granularity for RabbitMQ/internal queues.
This is not the per-model forward-pass batch size.
Per-model batches (see below) should be strictly smaller than this value; if not, the effective batch will be bounded by the queue packet size.
- embedding.max_sequence_length
Truncation length (tokens/residues) applied before embedding.
0= no truncation; sequences exceeding the model’s internal limit will error out and no embedding will be written.Positive values (e.g., 512, 1024, …) depend on VRAM and your use case.
- embedding.models.<ModelKey>.batch_size
Per-model embedding batch size (forward pass).
Must be strictly less than
embedding.queue_batch_size.
- embedding.models.<ModelKey>.layer_index
Hidden layers to extract for that model.
Prefer late layers for function-oriented signals; optionally add 1–2 intermediate layers for robustness.
Ensure the indices exist in the selected reference (lookup) table; non-existing layers will not yield neighbors.
- embedding.models.<ModelKey>.distance_threshold
Optional per-model cutoff applied before capping to
limit_per_entry.0→ pure top-k selection;> 0→ filter by threshold first, then cap by top-k.
Model & layer selection¶
FANTASIA processes queries per model and per layer. At any time it holds in memory only the current (model, layer): the query embeddings for that layer plus the in-memory reference table for that model.
Memory model (what actually sits in RAM/VRAM)¶
Peak memory is driven by the selected model (and current layer), not by how many models/layers are configured overall.
More models/layers mainly increase wall-clock time, not peak memory—provided each (model, layer) is processed individually.
Each (model, layer) must fit simultaneously with: the model’s reference table, the lookup batch being compared, and temporary distance buffers.
Selection guidelines¶
Choose models for their predictive value; plan around the largest model’s reference footprint you enable.
Prefer late layers first (often stronger functional signal); add 1–2 intermediate layers if you need extra stability across datasets.
Keep
limit_per_entryreasonable; raising it increases recall but may reduce precision.
Batching & limits¶
Lookup OOM → lower global
batch_size(and, if needed, reducelimit_per_entry).Embedding OOM → lower per-model
embedding.models.<ModelKey>.batch_sizeor select fewer layers.Very long sequences cost more memory/time; cap via
embedding.max_sequence_lengthif required.embedding.queue_batch_sizecontrols enqueue granularity (not the forward batch). It should be greater than every per-modelbatch_sizeto avoid being bottlenecked by the queue packet size.
Verification¶
Ensure requested
layer_indexvalues exist in the chosen lookup table; non-existent layers yield no neighbors. To verify, query PIS (SQL/ORM) for available(model, layer)pairs or consult the loaded reference metadata.