Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance & Benchmarks

Samyama is designed for “Mechanical Sympathy”—aligning software data structures with the physical reality of modern CPU caches and high-speed NVMe storage.

Recent Benchmark Results (Mac Mini M4, 2026-02-26)

All benchmarks run on Mac Mini M4, 16GB RAM, macOS. Comparison between the Community (CPU-only) and Enterprise (GPU-accelerated via wgpu) builds.

Ingestion Throughput

Samyama achieves industry-leading ingestion rates on commodity hardware:

OperationCPU-Only (ops/sec)GPU-Enabled (ops/sec)
Node Ingestion255,120412,036
Edge Ingestion4,211,3425,242,096

Note: Edge ingestion is significantly faster because it primarily involves appending to adjacency lists and updating the WAL.

Cypher Query Throughput (OLTP)

For transactional workloads, Samyama’s index-driven execution delivers consistent sub-millisecond latencies:

Graph ScaleQueries/secAvg Latency
10,000 nodes35,360 QPS0.028 ms
100,000 nodes116,373 QPS0.008 ms
1,000,000 nodes115,320 QPS0.008 ms

Index-driven lookups achieve O(1) or O(log n) access. QPS is measured with simple MATCH ... WHERE ... RETURN queries on indexed properties.

These numbers demonstrate that Samyama scales almost linearly—throughput at 1M nodes is comparable to 100K because index-based access eliminates full scans.

GPU Acceleration: The Crossover Point

A key finding in the v0.5.12 benchmarks is the impact of memory transfer overhead on GPU acceleration.

AlgorithmScale (Nodes)CPU ComputeGPU (inc. Transfer)Speedup
PageRank10,0000.6 ms9.3 ms0.06x (Slowdown)
PageRank100,0008.2 ms3.1 ms2.6x
PageRank1,000,00092.4 ms11.2 ms8.2x

Conclusion: For subgraphs smaller than 100,000 nodes, the CPU remains faster. Once the scale exceeds this “crossover point,” the GPU parallelism overcomes the memory transfer cost, leading to massive speedups.

Vector Search (HNSW, k=10)

Vector search utilizes hnsw_rs (CPU) for graph traversal. GPU acceleration in Enterprise is used for batch re-ranking after retrieval.

Metric (10K vectors, 128-dim)CPU-OnlyGPU Build
Cosine distance QPS15,872/s11,311/s
L2 distance QPS15,014/s10,429/s
Search 50K vectors10,446 QPS9,428 QPS

Note: The slight slowdown in the GPU build for small vector searches is due to the initialization overhead of the GPU context.

GPU at Scale: S-Size Datasets

On LDBC Graphalytics S-size datasets (millions of vertices), the GPU crossover becomes significant:

AlgorithmDatasetVerticesEdgesCPUGPUSpeedup
LCCcit-Patents3.8M16.5M9.6s4.7s2.0x
CDLPcit-Patents3.8M16.5M9.5s11.1s0.85x
PageRankdatagen-7_5-fb633K68.4MCPU fallback

Note: Extremely dense graphs (e.g., 68M edges on datagen-7_5-fb) trigger CPU fallback due to the 256MB GPU buffer limit on Apple Silicon. Dedicated GPUs with larger VRAM can handle these datasets.

LDBC Graphalytics Validation

Samyama has achieved 100% validation against the LDBC Graphalytics benchmark suite—the industry standard for graph analytics correctness:

AlgorithmXS Datasets (2)S Datasets (3)Total
BFS✅ 2/2✅ 3/35/5
PageRank✅ 2/2✅ 3/35/5
WCC✅ 2/2✅ 3/35/5
CDLP✅ 2/2✅ 3/35/5
LCC✅ 2/2✅ 3/35/5
SSSP✅ 2/2✅ 1/13/3
Total12/1216/1628/28

S-size datasets include cit-Patents (3.8M vertices), datagen-7_5-fb (633K vertices, 68M edges), and wiki-Talk (2.4M vertices). All results match LDBC reference outputs exactly.

Developer Tip: Run the validation yourself with cargo bench --bench graphalytics_benchmark. LDBC datasets are available in data/graphalytics/.

LDBC SNB Interactive & BI Workloads

Beyond Graphalytics (which validates algorithm correctness), Samyama includes benchmark harnesses for the LDBC Social Network Benchmark (SNB) — the industry-standard workload for graph database query performance.

SNB Interactive Workload

21 queries adapted for Samyama’s OpenCypher engine, plus 8 update operations:

CategoryQueriesDescription
Interactive ShortIS1–IS7Point lookups: person profile, posts, friends
Interactive ComplexIC1–IC14Multi-hop traversals: friend-of-friend, common interests, shortest paths
Insert OperationsINS1–INS8Concurrent writes: new persons, posts, comments, friendships
cargo bench --bench ldbc_benchmark                    # All 21 queries
cargo bench --bench ldbc_benchmark -- --query IC6     # Single query
cargo bench --bench ldbc_benchmark -- --updates       # Include writes

SNB Business Intelligence (BI) Workload

20 complex analytical queries testing OLAP-style aggregation over the social network graph:

CategoryQueriesDescription
BI QueriesBI-1 to BI-20Heavy aggregation, multi-hop analytics, temporal filtering

Note: Several BI queries require features beyond current OpenCypher coverage (APOC, CASE, list comprehensions). These are adapted to simplified Cypher that captures the analytical intent using supported constructs.

cargo bench --bench ldbc_bi_benchmark
cargo bench --bench ldbc_bi_benchmark -- --query BI-1

Both workloads operate on the LDBC SF1 dataset loaded via cargo run --example ldbc_loader.

LDBC FinBench Workload

Samyama also includes a harness for the LDBC Financial Benchmark (FinBench) — modeling financial transaction networks with accounts, persons, companies, loans, and mediums.

CategoryQueriesDescription
Complex ReadsCR1–CR12Multi-hop fund transfers, blocked account detection, loan chains
Simple ReadsSR1–SR6Account lookups, transfer history, sign-in records
Read-WritesRW1–RW3Mixed read-write transactions
WritesW1–W19Account creation, transfers, loan operations

40+ queries total, covering both OLTP and analytical patterns for financial graph workloads.

cargo bench --bench finbench_benchmark
cargo bench --bench finbench_benchmark -- --query CR-1
cargo bench --bench finbench_benchmark -- --writes    # Include write operations

Data is loaded via cargo run --example finbench_loader, which can generate synthetic FinBench-compatible datasets.

The Power of Late Materialization

One of our most impactful architectural choices remains Late Materialization.

Latency Impact (1M nodes)

Query TypeLatency (Before)Latency (After)Improvement
1-Hop Traversal164.11 ms41.00 ms4.0x
2-Hop Traversal1,220.00 ms259.00 ms4.7x

Bottleneck Analysis

Profiling our query engine reveals a shift in where time is spent:

ComponentTime% of 1-Hop
Parse (Pest grammar)~22ms54%
Plan (AST → Operators)~18ms44%
Execute (Iteration)<1ms2%

Conclusion: The actual execution of the graph traversal is sub-millisecond. The remaining overhead is in the language frontend (parsing and planning). Our roadmap includes AST Caching and Plan Memoization to bring warm-query latency down to the ~10ms range.

Note: These timings reflect cold-start conditions (first query execution). Subsequent queries benefit from OS-level page cache and instruction cache warmth, reducing total latency significantly.