Schemas¶
This section describes the main database schema used by the Protein Information System (PIS). It includes core entities, their attributes, and relationships.
Accession¶
Represents accession codes (e.g., UniProt identifiers) associated with proteins.
Primary Key:
codeFields:
primary(boolean): whether this accession is the main onetag(text): optional labelprotein_id(FK →protein.id)created_at,updated_at(timestamps)
GO Terms¶
Controlled vocabulary of Gene Ontology (GO) terms.
Primary Key:
go_idFields:
category(BP, MF, CC)description(text)
Protein¶
Represents proteins and their metadata.
Primary Key:
id(text, e.g., UniProt ID)Fields:
sequence_id(FK →sequence.id)data_class,molecule_typecreated_date,sequence_update_date,annotation_update_datedescription,gene_name,organism,organelletaxonomy_id(text)protein_existence(integer)comments,seqinfodisappeared(boolean)created_at,updated_at
Protein–GO Term Annotation¶
Links proteins with GO terms and evidence codes.
Primary Key:
idUnique Constraint: (
protein_id,go_id)Fields:
protein_id(FK →protein.id)go_id(FK →go_terms.go_id)evidence_code(text)
Sequence¶
Stores raw sequences.
Primary Key:
idFields:
sequence(text, required)sequence_hash(text, optional)
Note
A single sequence may be linked to multiple proteins. This allows embeddings to be shared, and implies that a single embedding-level hit can expand into multiple protein-level results.
Sequence Embedding Type¶
Describes available embedding models.
Primary Key:
idFields:
name(unique, e.g., ProtT5, ESM2)descriptiontask_namemodel_name
Sequence Embeddings¶
Stores embeddings per sequence, model, and layer.
Primary Key:
idUnique Constraint: (
sequence_id,embedding_type_id,layer_index)Fields:
sequence_id(FK →sequence.id)embedding_type_id(FK →sequence_embedding_type.id)layer_index(integer)embedding(halfvec)shape(integer[])created_at,updated_at
Relationships¶
One
Protein→ oneSequenceOne
Protein→ manyGO Terms(viaProtein–GO Term Annotation)One
Sequence→ manyEmbeddings(across types and layers)One
Sequence→ manyProteins(shared sequence reused by different proteins)