Research Paper: Knowledge Graphs for Industrial Operations

We have published a research paper evaluating knowledge graphs as the data layer for LLM-based industrial asset operations, building on the AssetOpsBench benchmark.

Title: Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

Authors: Madhulatha Mandarapu (madhulatha@samyama.ai), Sandeep Kunkunuru (sandeep@samyama.ai)

March 2026 | GitHub (assetops-kg) | IBM AssetOpsBench

Keywords: Knowledge Graphs, Large Language Models, Industrial Asset Operations, Benchmark, OpenCypher, Vector Search, Graph Algorithms.

Download PDF

Paper PDF — arxiv-ready LaTeX version (12 pages)
arxiv Upload Bundle — tex + bib for arxiv submission

LLM-based agents for industrial asset operations show promise but achieve limited accuracy when reasoning over flat document stores. The AssetOpsBench benchmark establishes that GPT-4 agents achieve 65% success on 139 industrial maintenance scenarios backed by CouchDB, YAML, and CSV data sources. AssetOpsBench evaluates LLM agent autonomy; we ask a complementary question: how much does the data model behind the tools affect agent performance?

Building on the same benchmark data and scenarios, we introduce a knowledge graph layer (781 nodes, 955 edges, 16 relationship types) and evaluate three architectures of increasing LLM involvement:

Architecture	LLM Role	Pass Rate	Avg Latency
Deterministic + graph	None (pre-coded)	99% (137/139)	63 ms
LLM + graph via NLQ	Generates Cypher	83% (115/139)	5,874 ms
Baseline (tool-augmented LLM)	Does everything	~65% (91/139)	not reported

Our key finding is inverted LLM usage: instead of asking the LLM to reason over raw data (a broad, error-prone task), we ask it to generate structured queries from a typed schema — a narrow problem that plays to LLM strengths. The graph then executes deterministically.

Thesis

For structured operational domains, the data model is the primary bottleneck. A knowledge graph with typed relationships enables both deterministic queries (for known patterns) and LLM-assisted queries (for novel questions), while document stores place the full data-reasoning burden on the LLM — a task where LLMs consistently struggle.

Three Architectures

Baseline: Tool-Augmented LLM (65%)

User question
  → LLM parses intent → LLM selects tool → Tool queries document store
    → LLM interprets raw results → LLM synthesizes answer

The LLM handles intent parsing, tool selection, argument crafting, data interpretation, and answer synthesis. GPT-4 achieves 65%. Failures cluster around counting, cross-document correlation, and relationship traversal — data operations rather than reasoning failures.

NLQ: LLM Generates Queries (83%)

User question
  → LLM generates Cypher (given schema)
    → Graph executes deterministically
      → LLM synthesizes answer from structured results

We invert the LLM’s role: instead of broad data reasoning, ask it to generate a Cypher query from a typed schema. This is code generation — a task LLMs excel at. The graph handles traversal, counting, and algorithms deterministically.

Deterministic: No LLM (99%)

User question
  → Keyword routing → Cypher query → Structured response

Pre-coded handlers for known patterns. A software engineering solution — demonstrates the ceiling with the right data model. 63ms average latency, zero token cost.

The Inverted LLM Pattern

The key insight: schema-aware query generation outperforms free-form data reasoning for any structured domain.

Architecture A asks: “LLM, answer this question from this data” (broad, error-prone)
Architecture B asks: “LLM, given this schema, write a Cypher query” (narrow, plays to strengths)

The same LLM, given a sharper problem scoped to its strengths, produces dramatically better results. Code generation is an LLM strength; data traversal, counting, and relationship reasoning are graph strengths. Each system does what it’s good at.

Knowledge Graph Schema

781 nodes, 955 edges, 11 labels, 16 edge types

Built from the AssetOpsBench data sources via an 8-step ETL pipeline:

Site ─[CONTAINS_LOCATION]→ Location ─[CONTAINS_EQUIPMENT]→ Equipment ─[HAS_SENSOR]→ Sensor
                                                              │
                                           DEPENDS_ON / SHARES_SYSTEM_WITH
                                                              │
FailureMode ─[MONITORS]→ Equipment ─[EXPERIENCED]→ FailureMode
WorkOrder ─[FOR_EQUIPMENT]→ Equipment
WorkOrder ─[ADDRESSES]→ FailureMode
Anomaly ─[TRIGGERED]→ WorkOrder
Event ─[FOR_EQUIPMENT]→ Equipment

Key additions over the baseline document model:

Equipment dependencies: DEPENDS_ON and SHARES_SYSTEM_WITH edges enable cascade analysis
Failure mode embeddings: 384-dim Sentence-BERT vectors in HNSW index enable similarity search
Unified event timeline: 6,256 events with ISO timestamps enable temporal queries

AssetOpsBench 139 Scenarios — Per-Type Results

Type	Count	Deterministic	NLQ (GPT-4o)	Baseline (GPT-4)
IoT	20	20/20 (100%)	17/20 (85%)	—
FMSR	40	40/40 (100%)	37/40 (93%)	—
TSFM	23	23/23 (100%)	21/23 (91%)	—
Multi	20	20/20 (100%)	8/20 (40%)	—
WO	36	34/36 (94%)	32/36 (89%)	—
Total	139	137/139 (99%)	115/139 (83%)	~91/139 (65%)

NLQ Multi stays at 40% because 12/20 scenarios require TSFM pipeline execution (forecasting, anomaly detection) that cannot be expressed as Cypher queries — a structural limitation.

Custom 40 Scenarios — Graph-Native Capabilities

40 new scenarios extending the benchmark with graph-native capabilities:

Category	Count	GPT-4o Avg	Samyama Avg	Delta
Failure similarity	6	0.501	0.902	+0.401
Criticality analysis	5	0.566	0.938	+0.372
Root cause analysis	5	0.580	0.934	+0.354
Multi-hop dependency	8	0.618	0.934	+0.316
Maintenance optimization	5	0.634	0.931	+0.297
Cross-asset correlation	6	0.638	0.929	+0.291
Temporal pattern	5	0.679	0.923	+0.244

Largest gains on failure similarity (+0.401) and criticality analysis (+0.372) — exactly where graph structure and vector search provide the most value. GPT-4o’s 6 failures all require graph traversal, PageRank, or vector search that LLMs cannot perform from parametric knowledge alone.

The Full Pipeline: LLMs at the Edges, Graph in the Middle

The query layer comparison above is only part of the story. The full industrial data pipeline has three layers:

Data Ingestion (software engineering): Structured data (90%+) → deterministic ETL. Unstructured data (maintenance logs, PDFs) → LLM-assisted entity extraction, resolution, classification.
Data Model (architecture decision): One-time choice between flat documents and knowledge graph.
Query (LLM optional): Deterministic handlers for known patterns; LLM-generated Cypher for novel questions.

LLMs appear at both edges — data preparation (unstructured → structured) and query generation (natural language → Cypher). The graph is the stable center that receives data from both deterministic and LLM-assisted ingestion, and serves both deterministic and LLM-generated queries.

In both cases, the LLM performs a generation task (structured output from unstructured input) — its strength. The graph handles data operations (storage, traversal, algorithms) — its strength. Neither component is asked to do what it’s bad at.

Scalability

Dimension	Arch. A (LLM + docs)	Arch. B/C (graph ± LLM)
10K queries/day	$300–500 (tokens)	$0 (deterministic) or ~$30 (NLQ)
Real-time streaming	Not supported	Graph updates + continuous queries
Multi-hop at 10K assets	LLM reasons across 10K docs	BFS traversal, O(\|E\|)
Latency per query	5–11 seconds	63 ms (det.) / ~6 s (NLQ)

Honest Caveats

Deterministic vs. autonomous: The 99% result compares pre-coded answers against an autonomous agent — fundamentally different tasks. The comparison illustrates the ceiling achievable with the right data model, not a claim of superior agent intelligence.
Model mismatch: The baseline used GPT-4; NLQ used GPT-4o. The +18pp gap is an upper bound. Same-model comparison pending.
Clean data: AssetOpsBench provides clean, structured data. Real-world messy data needs LLM-assisted preparation.
Custom scenarios: Designed to extend the benchmark with graph-native capabilities, not replace the original scenarios.
Complementary research questions: AssetOpsBench evaluates LLM agent autonomy. We evaluate data model impact. Both are valid; our results do not diminish the value of the original benchmark.

Conclusion

Building on AssetOpsBench, we show that introducing a knowledge graph as the data layer improves LLM-based industrial operations at every level of LLM involvement. For structured operational domains, the data model is the primary bottleneck. The inverted LLM pattern (schema-aware query generation instead of free-form data reasoning) is generalizable to any structured domain.

Implementation

Benchmark code: samyama-ai/assetops-kg
Graph database: samyama-ai/samyama-graph
Rust demo: cargo run --example industrial_kg_demo (871 lines)
Python SDK: pip install samyama (PyPI)
Community PR: AssetOpsBench PR #203 — 40 new graph-native scenarios contributed back to the benchmark

Building Samyama: The Architecture of a Modern Rust Graph Database