PLM Models¶
Scope¶
FANTASIA integrates pretrained protein language models (PLMs) through a model registry that standardizes tokenization, batching, device placement, and hidden-layer extraction. Models are enabled and configured at runtime via YAML (see Defaults aligned with the bundled lookup table below).
Supported Embedding Models¶
Name (registry key) |
Model ID |
Params (≈) |
Architecture |
Notes |
|---|---|---|---|---|
ESM-2 |
|
650M |
Encoder (33L) |
Large-scale encoder without MSAs; strong accuracy across structure/function tasks. |
ProtT5 |
|
1.2B |
Encoder–Decoder |
Trained on UniRef50; robust transfer for downstream structure/function tasks. |
ProstT5 |
|
1.2B |
Multi-modal T5 |
Incorporates sequence+3Di states; improves contact/function representations. |
Ankh3-Large |
|
620M |
Encoder (T5-style) |
Fast inference with solid semantic/structural signals. |
ESM3c |
|
600M |
Encoder (36L) |
New-generation encoder trained on broad protein corpora; high precision and speed. |
Default method configuration for main LookUp table¶
The following configuration matches the distributed setup (model keys and layer indices) and the
runner’s expectations (distance_metric at the top level):
embedding:
device: cuda
queue_batch_size: 100
max_sequence_length: 0
distance_metric: cosine
models:
ESM-2:
enabled: true
batch_size: 1
layer_index: [0]
distance_threshold: 0
ESM3c:
enabled: true
batch_size: 1
layer_index: [0]
distance_threshold: 0
Ankh3-Large:
enabled: true
batch_size: 1
layer_index: [0]
distance_threshold: 0
ProtT5:
enabled: true
batch_size: 1
layer_index: [0]
distance_threshold: 0
ProstT5:
enabled: true
batch_size: 1
layer_index: [0]
distance_threshold: 0
Configuration Notes¶
Registry mapping: keys under
embedding.models(e.g.,ESM-2,ProtT5) must match the registry/type names used by your environment so the embedder can resolve the correct model/tokenizer/module.Hidden-layer selection: all indices listed under
layer_indexare extracted per model; each layer is persisted independently in HDF5 and becomes available to lookup.Distance metric: set
distance_metricat the root of the YAML; the lookup stage reads it from there (not fromembedding).Batching & device:
batch_size(per model) and globaldevicecontrol throughput and memory pressure during embedding; tune to your hardware budget.