Schemas¶

This section describes the main database schema used by the Protein Information System (PIS). It includes core entities, their attributes, and relationships.

Accession ¶

Represents accession codes (e.g., UniProt identifiers) associated with proteins.

Primary Key: code
Fields:
- primary (boolean): whether this accession is the main one
- tag (text): optional label
- protein_id (FK → protein.id)
- created_at, updated_at (timestamps)

GO Terms ¶

Controlled vocabulary of Gene Ontology (GO) terms.

Primary Key: go_id
Fields:
- category (BP, MF, CC)
- description (text)

Protein ¶

Represents proteins and their metadata.

Primary Key: id (text, e.g., UniProt ID)
Fields:
- sequence_id (FK → sequence.id)
- data_class, molecule_type
- created_date, sequence_update_date, annotation_update_date
- description, gene_name, organism, organelle
- taxonomy_id (text)
- protein_existence (integer)
- comments, seqinfo
- disappeared (boolean)
- created_at, updated_at

Protein–GO Term Annotation ¶

Links proteins with GO terms and evidence codes.

Primary Key: id
Unique Constraint: (protein_id, go_id)
Fields:
- protein_id (FK → protein.id)
- go_id (FK → go_terms.go_id)
- evidence_code (text)

Sequence ¶

Stores raw sequences.

Primary Key: id
Fields:
- sequence (text, required)
- sequence_hash (text, optional)

Note

A single sequence may be linked to multiple proteins. This allows embeddings to be shared, and implies that a single embedding-level hit can expand into multiple protein-level results.

Sequence Embedding Type ¶

Describes available embedding models.

Primary Key: id
Fields:
- name (unique, e.g., ProtT5, ESM2)
- description
- task_name
- model_name

Sequence Embeddings ¶

Stores embeddings per sequence, model, and layer.

Primary Key: id
Unique Constraint: (sequence_id, embedding_type_id, layer_index)
Fields:
- sequence_id (FK → sequence.id)
- embedding_type_id (FK → sequence_embedding_type.id)
- layer_index (integer)
- embedding (halfvec)
- shape (integer[])
- created_at, updated_at

Relationships ¶

One Protein → one Sequence
One Protein → many GO Terms (via Protein–GO Term Annotation)
One Sequence → many Embeddings (across types and layers)
One Sequence → many Proteins (shared sequence reused by different proteins)