Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Research Paper: Open Biomedical Knowledge Graphs at Scale

We have published a research paper on constructing, federating, and querying biomedical knowledge graphs with Samyama.

Title: Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database

Authors: Madhulatha Mandarapu (madhulatha@samyama.ai), Sandeep Kunkunuru (sandeep@samyama.ai)

March 2026 | Pathways KG | Clinical Trials KG

Keywords: Knowledge Graphs, Biomedical Data Integration, Graph Databases, Cross-KG Federation, Model Context Protocol, Clinical Trials, Biological Pathways, OpenCypher.


Download PDF

  • Paper PDF — arxiv-ready LaTeX version (10 pages)

Abstract

Biomedical knowledge is fragmented across siloed databases — Reactome for pathways, STRING for protein interactions, Gene Ontology for functional annotations, ClinicalTrials.gov for study registries, and dozens more. We present two open-source biomedical knowledge graphs — Pathways KG (118,686 nodes, 834,785 edges from 5 sources) and Clinical Trials KG (7,711,965 nodes, 27,069,085 edges from 5 sources) — built on Samyama, a high-performance graph database written in Rust.

Our contributions are threefold:

  1. Reproducible KG construction — ETL pipelines for two large-scale KGs using a common pattern: download, parse, deduplicate, batch-load via Cypher, and export as portable .sgsnap snapshots.

  2. Cross-KG federation — loading both snapshots into a single graph tenant enables property-based joins across datasets, answering questions like “Which biological pathways are disrupted by drugs currently in Phase 3 trials for breast cancer?”

  3. Schema-driven MCP server generation — each KG automatically exposes typed tools for LLM agents via the Model Context Protocol, enabling natural-language access without manual tool authoring.

The combined federated graph (7.83M nodes, 27.9M edges) loads in under 3 minutes on commodity hardware.


Key Results

MetricPathways KGClinical Trials KGCombined
Nodes118,6867,711,9657,830,651
Edges834,78527,069,08527,903,870
Labels51520
Edge types92534
Data sources5510
Snapshot size9 MB711 MB720 MB
Import time< 5 s~90 s~95 s

Cross-KG Federation Query Patterns

PatternTraversalLatency
Drug → PathwayTrial → Drug → Protein → Pathway2.5 s
Drug → GO ProcessTrial → Drug → Protein → GOTerm1.8 s
Drug → PPI NetworkDrug → Protein target → INTERACTS_WITH1.2 s
Disease → PathwayGene → Disease + Gene → Protein → Pathway1.8 s
Adverse Event → PathwayTrial → AE → Drug → Protein → Pathway3.2 s