Performance & Benchmarks
Samyama is designed for “Mechanical Sympathy”—aligning software data structures with the physical reality of modern CPU caches and high-speed NVMe storage.
Recent Benchmark Results (Mac Mini M4, 2026-02-26)
All benchmarks run on Mac Mini M4, 16GB RAM, macOS. Comparison between the Community (CPU-only) and Enterprise (GPU-accelerated via wgpu) builds.
Ingestion Throughput
Samyama achieves industry-leading ingestion rates on commodity hardware:
| Operation | CPU-Only (ops/sec) | GPU-Enabled (ops/sec) |
|---|---|---|
| Node Ingestion | 255,120 | 412,036 |
| Edge Ingestion | 4,211,342 | 5,242,096 |
Note: Edge ingestion is significantly faster because it primarily involves appending to adjacency lists and updating the WAL.
Cypher Query Throughput (OLTP)
For transactional workloads, Samyama’s index-driven execution delivers consistent sub-millisecond latencies:
| Graph Scale | Queries/sec | Avg Latency |
|---|---|---|
| 10,000 nodes | 35,360 QPS | 0.028 ms |
| 100,000 nodes | 116,373 QPS | 0.008 ms |
| 1,000,000 nodes | 115,320 QPS | 0.008 ms |
Index-driven lookups achieve O(1) or O(log n) access. QPS is measured with simple MATCH ... WHERE ... RETURN queries on indexed properties.
These numbers demonstrate that Samyama scales almost linearly—throughput at 1M nodes is comparable to 100K because index-based access eliminates full scans.
GPU Acceleration: The Crossover Point
A key finding in the v0.5.12 benchmarks is the impact of memory transfer overhead on GPU acceleration.
| Algorithm | Scale (Nodes) | CPU Compute | GPU (inc. Transfer) | Speedup |
|---|---|---|---|---|
| PageRank | 10,000 | 0.6 ms | 9.3 ms | 0.06x (Slowdown) |
| PageRank | 100,000 | 8.2 ms | 3.1 ms | 2.6x |
| PageRank | 1,000,000 | 92.4 ms | 11.2 ms | 8.2x |
Conclusion: For subgraphs smaller than 100,000 nodes, the CPU remains faster. Once the scale exceeds this “crossover point,” the GPU parallelism overcomes the memory transfer cost, leading to massive speedups.
Vector Search (HNSW, k=10)
Vector search utilizes hnsw_rs (CPU) for graph traversal. GPU acceleration in Enterprise is used for batch re-ranking after retrieval.
| Metric (10K vectors, 128-dim) | CPU-Only | GPU Build |
|---|---|---|
| Cosine distance QPS | 15,872/s | 11,311/s |
| L2 distance QPS | 15,014/s | 10,429/s |
| Search 50K vectors | 10,446 QPS | 9,428 QPS |
Note: The slight slowdown in the GPU build for small vector searches is due to the initialization overhead of the GPU context.
GPU at Scale: S-Size Datasets
On LDBC Graphalytics S-size datasets (millions of vertices), the GPU crossover becomes significant:
| Algorithm | Dataset | Vertices | Edges | CPU | GPU | Speedup |
|---|---|---|---|---|---|---|
| LCC | cit-Patents | 3.8M | 16.5M | 9.6s | 4.7s | 2.0x |
| CDLP | cit-Patents | 3.8M | 16.5M | 9.5s | 11.1s | 0.85x |
| PageRank | datagen-7_5-fb | 633K | 68.4M | — | CPU fallback | — |
Note: Extremely dense graphs (e.g., 68M edges on datagen-7_5-fb) trigger CPU fallback due to the 256MB GPU buffer limit on Apple Silicon. Dedicated GPUs with larger VRAM can handle these datasets.
LDBC Graphalytics Validation
Samyama has achieved 100% validation against the LDBC Graphalytics benchmark suite—the industry standard for graph analytics correctness:
| Algorithm | XS Datasets (2) | S Datasets (3) | Total |
|---|---|---|---|
| BFS | ✅ 2/2 | ✅ 3/3 | 5/5 |
| PageRank | ✅ 2/2 | ✅ 3/3 | 5/5 |
| WCC | ✅ 2/2 | ✅ 3/3 | 5/5 |
| CDLP | ✅ 2/2 | ✅ 3/3 | 5/5 |
| LCC | ✅ 2/2 | ✅ 3/3 | 5/5 |
| SSSP | ✅ 2/2 | ✅ 1/1 | 3/3 |
| Total | 12/12 | 16/16 | 28/28 |
S-size datasets include cit-Patents (3.8M vertices), datagen-7_5-fb (633K vertices, 68M edges), and wiki-Talk (2.4M vertices). All results match LDBC reference outputs exactly.
Developer Tip: Run the validation yourself with
cargo bench --bench graphalytics_benchmark. LDBC datasets are available indata/graphalytics/.
LDBC SNB Interactive & BI Workloads
Beyond Graphalytics (which validates algorithm correctness), Samyama includes benchmark harnesses for the LDBC Social Network Benchmark (SNB) — the industry-standard workload for graph database query performance.
SNB Interactive Workload
21 queries adapted for Samyama’s OpenCypher engine, plus 8 update operations:
| Category | Queries | Description |
|---|---|---|
| Interactive Short | IS1–IS7 | Point lookups: person profile, posts, friends |
| Interactive Complex | IC1–IC14 | Multi-hop traversals: friend-of-friend, common interests, shortest paths |
| Insert Operations | INS1–INS8 | Concurrent writes: new persons, posts, comments, friendships |
cargo bench --bench ldbc_benchmark # All 21 queries
cargo bench --bench ldbc_benchmark -- --query IC6 # Single query
cargo bench --bench ldbc_benchmark -- --updates # Include writes
SNB Business Intelligence (BI) Workload
20 complex analytical queries testing OLAP-style aggregation over the social network graph:
| Category | Queries | Description |
|---|---|---|
| BI Queries | BI-1 to BI-20 | Heavy aggregation, multi-hop analytics, temporal filtering |
Note: Several BI queries require features beyond current OpenCypher coverage (APOC, CASE, list comprehensions). These are adapted to simplified Cypher that captures the analytical intent using supported constructs.
cargo bench --bench ldbc_bi_benchmark
cargo bench --bench ldbc_bi_benchmark -- --query BI-1
Both workloads operate on the LDBC SF1 dataset loaded via cargo run --example ldbc_loader.
LDBC FinBench Workload
Samyama also includes a harness for the LDBC Financial Benchmark (FinBench) — modeling financial transaction networks with accounts, persons, companies, loans, and mediums.
| Category | Queries | Description |
|---|---|---|
| Complex Reads | CR1–CR12 | Multi-hop fund transfers, blocked account detection, loan chains |
| Simple Reads | SR1–SR6 | Account lookups, transfer history, sign-in records |
| Read-Writes | RW1–RW3 | Mixed read-write transactions |
| Writes | W1–W19 | Account creation, transfers, loan operations |
40+ queries total, covering both OLTP and analytical patterns for financial graph workloads.
cargo bench --bench finbench_benchmark
cargo bench --bench finbench_benchmark -- --query CR-1
cargo bench --bench finbench_benchmark -- --writes # Include write operations
Data is loaded via cargo run --example finbench_loader, which can generate synthetic FinBench-compatible datasets.
The Power of Late Materialization
One of our most impactful architectural choices remains Late Materialization.
Latency Impact (1M nodes)
| Query Type | Latency (Before) | Latency (After) | Improvement |
|---|---|---|---|
| 1-Hop Traversal | 164.11 ms | 41.00 ms | 4.0x |
| 2-Hop Traversal | 1,220.00 ms | 259.00 ms | 4.7x |
Bottleneck Analysis
Profiling our query engine reveals a shift in where time is spent:
| Component | Time | % of 1-Hop |
|---|---|---|
| Parse (Pest grammar) | ~22ms | 54% |
| Plan (AST → Operators) | ~18ms | 44% |
| Execute (Iteration) | <1ms | 2% |
Conclusion: The actual execution of the graph traversal is sub-millisecond. The remaining overhead is in the language frontend (parsing and planning). Our roadmap includes AST Caching and Plan Memoization to bring warm-query latency down to the ~10ms range.
Note: These timings reflect cold-start conditions (first query execution). Subsequent queries benefit from OS-level page cache and instruction cache warmth, reducing total latency significantly.