Jupyter Notebook

Gene Ontology (GO)#

Pathways represent interconnected molecular networks of signaling cascades that govern critical cellular processes. They provide understandings cellular behavior mechanisms, insights of disease progression and treatment responses. In an R&D organization, managing pathways across different datasets are crucial for gaining insights of potential therapeutic targets and intervention strategies.

In this notebook we manage a pathway registry based on “2023 GO Biological Process” ontology. We’ll walk you through the steps of registering pathways and link them to genes.

In the following Standardize metadata on-the-fly notebook, we’ll demonstrate how to perform a pathway enrichment analysis and track the dataset with LaminDB.

Setup#

Warning

Please ensure that you have created or loaded a LaminDB instance before running the remaining part of this notebook!

This notebook follows the CellTypist, which populate the CellType registry.

!lamin load use-cases-registries
Hide code cell output
💡 found cached instance metadata: /home/runner/.lamin/instance--testuser1--use-cases-registries.env
💡 loaded instance: testuser1/use-cases-registries
import lamindb as ln
import lnschema_bionty as lb
import gseapy as gp

lb.settings.organism = "human"  # globally set organism
💡 lamindb instance: testuser1/use-cases-registries
2024-01-03 01:32:52,182:INFO - Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face (unknown file format; error code 0x2)
2024-01-03 01:32:52,252:INFO - generated new fontManager

Fetch GO pathways annotated with human genes using Enrichr#

First we fetch the “GO_Biological_Process_2023” pathways for humans using GSEApy which wraps GSEA and Enrichr.

go_bp = gp.get_library(name="GO_Biological_Process_2023", organism="Human")
print(f"Number of pathways {len(go_bp)}")
2024-01-03 01:32:53,260:INFO - Downloading and generating Enrichr library gene sets...
2024-01-03 01:33:10,311:INFO - 0001 gene_sets have been filtered out when max_size=2000 and min_size=0
Number of pathways 5406
go_bp["ATF6-mediated Unfolded Protein Response (GO:0036500)"]
['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF']

Parse out the ontology_id from keys, convert into the format of {ontology_id: (name, genes)}

def parse_ontology_id_from_keys(key):
    """Parse out the ontology id.

    "ATF6-mediated Unfolded Protein Response (GO:0036500)" -> ("GO:0036500", "ATF6-mediated Unfolded Protein Response")
    """
    id = key.split(" ")[-1].replace("(", "").replace(")", "")
    name = key.replace(f" ({id})", "")
    return (id, name)
go_bp_parsed = {}

for key, genes in go_bp.items():
    id, name = parse_ontology_id_from_keys(key)
    go_bp_parsed[id] = (name, genes)
go_bp_parsed["GO:0036500"]
('ATF6-mediated Unfolded Protein Response',
 ['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF'])

Register pathway ontology in LaminDB#

bionty = lb.Pathway.bionty()
bionty
Pathway
Organism: all
Source: go, 2023-05-10
#terms: 47514

📖 Pathway.df(): ontology reference table
🔎 Pathway.lookup(): autocompletion of terms
🎯 Pathway.search(): free text search of terms
✅ Pathway.validate(): strictly validate values
🧐 Pathway.inspect(): full inspection of values
👽 Pathway.standardize(): convert to standardized names
🪜 Pathway.diff(): difference between two versions
🔗 Pathway.ontology: Pronto.Ontology object

Next, we register all the pathways and genes in LaminDB to finally link pathways to genes.

Register pathway terms#

To register the pathways we make use of .from_values to directly parse the annotated GO pathway ontology IDs into LaminDB.

pathway_records = lb.Pathway.from_values(go_bp_parsed.keys(), lb.Pathway.ontology_id)
lb.Pathway.from_bionty(ontology_id="GO:0015868")
Pathway(uid='SMqshx3Y', name='purine ribonucleotide transport', ontology_id='GO:0015868', description='The Directed Movement Of A Purine Ribonucleotide, Any Compound Consisting Of A Purine Ribonucleoside (A Purine Organic Base Attached To A Ribose Sugar) Esterified With (Ortho)Phosphate, Into, Out Of Or Within A Cell.', bionty_source_id=44, created_by_id=1)
ln.save(pathway_records, parents=False)  # not recursing through parents

Register gene symbols#

Similarly, we use .from_values for all Pathway associated genes to register them with LaminDB.

all_genes = {g for genes in go_bp.values() for g in genes}
gene_records = lb.Gene.from_values(all_genes, lb.Gene.symbol)
Hide code cell output
❗ ambiguous validation in Bionty for 1082 records: 'SOCS7', 'TAS2R14', 'WDR73', 'COL18A1', 'OR10A6', 'ADCY4', 'P2RY8', 'GREM1', 'RAB11FIP3', 'ADAM9', 'CENATAC', 'IFI27L2', 'NCF4', 'SLC39A4', 'MLLT6', 'OR2G6', 'SLC5A8', 'CRLF2', 'TJP1', 'PKLR', ...
did not create Gene records for 37 non-validated symbols: 'AFD1', 'AZF1', 'CCL4L1', 'DGS2', 'DUX3', 'DUX5', 'FOXL3-OT1', 'IGL', 'LOC100653049', 'LOC102723475', 'LOC102723996', 'LOC102724159', 'LOC107984156', 'LOC112268384', 'LOC122319436', 'LOC122513141', 'LOC122539214', 'LOC344967', 'MDRV', 'MTRNR2L1', ...
gene_records[:3]
[Gene(uid='9vOfWoQ0YYk0', symbol='PNRC1', ensembl_gene_id='ENSG00000146278', ncbi_gene_ids='10957', biotype='protein_coding', description='proline rich nuclear receptor coactivator 1 [Source:HGNC Symbol;Acc:HGNC:17278]', synonyms='PROL2|PRR2|B4-2', organism_id=1, bionty_source_id=9, created_by_id=1),
 Gene(uid='MA0FNXJ070ca', symbol='DGKE', ensembl_gene_id='ENSG00000153933', ncbi_gene_ids='8526', biotype='protein_coding', description='diacylglycerol kinase epsilon [Source:HGNC Symbol;Acc:HGNC:2852]', synonyms='DAGK6|DGK', organism_id=1, bionty_source_id=9, created_by_id=1),
 Gene(uid='ynx2rS2KTKL0', symbol='VCPIP1', ensembl_gene_id='ENSG00000175073', ncbi_gene_ids='80124', biotype='protein_coding', description='valosin containing protein interacting protein 1 [Source:HGNC Symbol;Acc:HGNC:30897]', synonyms='VCIP135|KIAA1850|FLJ23132|DUBA3', organism_id=1, bionty_source_id=9, created_by_id=1)]
ln.save(gene_records);