PLM Models

Scope

FANTASIA integrates pretrained protein language models (PLMs) through a model registry that standardizes tokenization, batching, device placement, and hidden-layer extraction. Models are enabled and configured at runtime via YAML (see Defaults aligned with the bundled lookup table below).

Supported Embedding Models

Name (registry key)

Model ID

Params (≈)

Architecture

Notes

ESM-2

facebook/esm2_t33_650M_UR50D

650M

Encoder (33L)

Large-scale encoder without MSAs; strong accuracy across structure/function tasks.

ProtT5

Rostlab/prot_t5_xl_uniref50

1.2B

Encoder–Decoder

Trained on UniRef50; robust transfer for downstream structure/function tasks.

ProstT5

Rostlab/ProstT5

1.2B

Multi-modal T5

Incorporates sequence+3Di states; improves contact/function representations.

Ankh3-Large

ElnaggarLab/ankh3-large

620M

Encoder (T5-style)

Fast inference with solid semantic/structural signals.

ESM3c

esmc_600m

600M

Encoder (36L)

New-generation encoder trained on broad protein corpora; high precision and speed.

Default method configuration for main LookUp table

The following configuration matches the distributed setup (model keys and layer indices) and the runner’s expectations (distance_metric at the top level):

embedding:
  device: cuda
  queue_batch_size: 100
  max_sequence_length: 0
  distance_metric: cosine

  models:
    ESM-2:
      enabled: true
      batch_size: 1
      layer_index: [0]
      distance_threshold: 0

    ESM3c:
      enabled: true
      batch_size: 1
      layer_index: [0]
      distance_threshold: 0

    Ankh3-Large:
      enabled: true
      batch_size: 1
      layer_index: [0]
      distance_threshold: 0

    ProtT5:
      enabled: true
      batch_size: 1
      layer_index: [0]
      distance_threshold: 0

    ProstT5:
      enabled: true
      batch_size: 1
      layer_index: [0]
      distance_threshold: 0

Configuration Notes

  • Registry mapping: keys under embedding.models (e.g., ESM-2, ProtT5) must match the registry/type names used by your environment so the embedder can resolve the correct model/tokenizer/module.

  • Hidden-layer selection: all indices listed under layer_index are extracted per model; each layer is persisted independently in HDF5 and becomes available to lookup.

  • Distance metric: set distance_metric at the root of the YAML; the lookup stage reads it from there (not from embedding).

  • Batching & device: batch_size (per model) and global device control throughput and memory pressure during embedding; tune to your hardware budget.