Result Storage and Post-processing¶

Scope¶

This section covers the two final stages of FANTASIA:

Result storage — neighbor hits are expanded and written to disk.
Post-processing — stored results are aggregated, scored, and collapsed into final outputs.

Store Entry¶

The store_entry method expands lookup hits into full annotation rows and persists them to disk for each query accession per model & layer. It bridges the lookup phase and the post-processing phase by generating reproducible raw result files that can be reused without recomputing lookup.

Workflow

Input
- Compact hits from lookup: (accession, ref_sequence_id, distance, model, layer).
Expansion
- For each hit, fetch the associated GO annotations (and sequences if enabled).
- Attach metadata such as model_name, layer_index, and a reliability_index derived from the distance.
- Expand into full annotation rows ready for storage.
Persistence
- Write results into per-accession CSVs under raw_results/.
- File naming follows the pattern: - raw_results_layer_<k>.csv (if layers are distinguished). - raw_results.csv (legacy flat file without layer separation).
- A combined sequences.fasta is also written if sequences are kept.
- If redundancy filtering (e.g., MMseqs2) is enabled, cluster definitions are generated.

Outputs

Per-accession raw files under raw_results/.
sequences.fasta with all query and reference sequences (optional).
Cluster structures, only if redundancy filtering is active.

Post-processing¶

The post_processing method aggregates the raw per-accession CSVs, computes weighted scores, and produces a global summary along with enrichment-ready exports. This is the final consolidation step of the pipeline.

Workflow

Locate inputs
- Collect all CSV shards under raw_results/**.
- Group them by accession.
Load configuration
- Parameters are taken from conf['postprocess']['summary'].
- Defines metrics, aliases, inclusion of counts, weights, and weighted prefix.
Aggregation per accession
- Concatenate all shards for an accession.
- Group rows by (accession, go_id, model_name, layer_index).
- Compute aggregation metrics (mean, max, min) for each configured column.
- Add normalized support counts (neighbors / k) if enabled.
Weighting and scoring
- Resolve weights from configuration, normalize them, and apply them to aggregated metrics.
- Produce weighted columns (w_<metric>) and a composite final_score.
- Preserve both global (final_score) and per-model/layer scores.
Summarization
- Join aggregated values with counts and protein lists per GO term.
- Pivot metrics into wide format to produce a concise accession–GO table.
- Write incrementally to summary.csv.
Exports - Generate TopGO-compatible files for per-model/layer and ensemble configurations (topgo/...).