SysML v2 as a Composable Knowledge Graph

Andrew Dunn · Nomograph Labs · March 2026

The Short Version

Three artifacts, MIT-licensed:

tree-sitter-sysml
Parser. 192 tests. 89% external file coverage.
npm · crates.io · PyPI

sysml
CLI tool (MCP server built in).
14 commands, 10 MCP tools.
9-signal hybrid index.
crates.io

sysml-bench
132 tasks, 4 models,
40+ conditions, N=3–10.
Schema ablation (5 conditions).
PyPI · HuggingFace

We think we built the SysML v2 equivalent of what GKG is building for code: a tree-sitter parser, a knowledge graph, and a CLI tool with an MCP server built in. Then we ran a benchmark to measure what actually helps LLMs comprehend structured engineering models.

The short answer: how you present information to the model matters more than how you retrieve it — and tool design matters more than tool count. Most of the energy in the tool-augmented LLM space right now is going into retrieval infrastructure: vector databases, graph traversal, multi-hop reasoning chains. On our benchmark, adding retrieval infrastructure did not help. What did help was representation (pre-rendered views), guidance (tool selection hints), and good tool design. A schema ablation study on discovery tasks revealed that the graph tool penalty was caused by one specific tool creating a selection trap — not by graph tools inherently. Remove that tool, and the remaining graph configuration outscored search-only on discovery tasks (0.925 vs 0.872, N=5). Graph traversal at various hop depths still showed no advantage (O9, O14).

This is an exploratory study on a single corpus. We are sharing it because the architectural parallel to GKG is direct, the findings are relevant to decisions your team is making now, and we would genuinely like to work on this problem alongside you.

What We Built

The apparatus is designed to generalize. Swap in a different corpus, a different parser, a different set of representation tools — the bench methodology stays the same. SysML v2 is the first domain.

vary the inputs pick a domain →

corpus

SysML v2
AADL
OSCAL
KiCad
···

parser

tree-sitter
···

representation

search
graph · render
inspect · plan
···

ablate · measure · compare

bench harness

tasks × models × tool sets × N runs · structured scoring · statistical framework

share what we find

observations · publications

open data · open source · MIT licensed

tree-sitter-sysml — Tree-sitter grammar for SysML v2 (OMG-adopted June 2025). Built by curating a corpus of real-world SysML v2 models and iterating the grammar against it: run the parser, find failures, fix the grammar, repeat. 89% coverage on external files. 192 tests passing. Bindings for Rust, C, Node.js, Python, Go, and Swift. MIT.

sysml — Rust CLI tool with an MCP server built in. Indexes .sysml repositories into a persistent knowledge graph and exposes it through 14 CLI commands and 10 MCP tools. 9-signal hybrid index: keyword scoring (8 signals including exact match, prefix, containment, vocabulary expansion, relationship adjacency) plus fastembed all-MiniLM-L6-v2 vector search (384-dim, HNSW) and 27 SysML v2 structural relationship types. 94% average token reduction vs raw file injection. 123 tests.

sysml-bench — Python evaluation harness. 132 tasks across 8 categories: discovery, reasoning, explanation, layer, boundary, vector-sensitive, structural trace, and corpus scaling. Per-field structured scoring (Bool, Float, Str, ListStr F1 with threshold). Corpus: Eve Online Mining Frigate SysML v2 model, 19 files, 798 elements, 1,515 relationships.

Architectural parallel with GKG (Orbit)

Layer	GitLab GKG	Nomograph SysML
Parsing	Rust + tree-sitter	Rust + tree-sitter-sysml
Data model	KuzuDB (multi-hop)	27-type SysML index
Retrieval	Hybrid (TERAG)	9-signal hybrid
Token efficiency	80–97% reduction	94% average
Interface	MCP server	CLI + MCP server

Different domain (Orbit indexes software repositories; we index SysML v2 models), same retrieval architecture. The benchmark observations translate directly to Orbit's design space. We think four of them are worth your team's attention.

Key Findings

Four observations from our benchmark that we think map directly onto GKG's design space. All are exploratory: single corpus, small sample sizes, and the statistical significance does not hold up when you account for running 14 tests at once. We present them as hypotheses worth testing on code, not as confirmed results.

O12 — Guided tool selection GKG context: If Orbit agents underperform with the full tool set, the fix may be a sentence in the system prompt, not restricting tools.
One sentence of selection guidance eliminated a 13-point over-tooling penalty (0.887 vs 0.750). Strongest statistical signal in the study (p=0.009, large effect). O8 — CLI search vs RAG by task type GKG context: Validates the hybrid architecture. Neither retrieval strategy is universally better. Task type should inform which path.
CLI tool-based search outperformed RAG on discovery tasks by 29 points (p=0.021). RAG edged ahead on reasoning tasks by 14 points (not statistically significant). O1 — Aggregate metrics hide task-level structure GKG context: If GKG evaluates with a single accuracy number across task types, it is probably hiding the same structure we found. You have to look at each task type separately.
Overall tool comparison: no significant difference. But individual tasks ranged from −0.400 to +0.800, meaning tools helped enormously on some tasks and hurt on others. O10 — Corpus scale collapses performance GKG context: This is the regime GKG operates in. Real codebases are hundreds or thousands of files. Our small-corpus results may not transfer.
0.880 at 19 files → 0.423 at 95 files. 55% of the time the agent ran out of turns before finishing, not because it couldn't find the right information.

Supporting: O4 (pre-rendered views outscored letting the agent assemble its own context on explanation tasks; largest measured effect but specific to explanation tasks and less directly relevant to code comprehension). Full details below.

The Larger Opportunity

Engineering artifact types that share the same retrieval problem:

Domain	Language
Systems	SysML v2
Embedded	AADL
Safety	OSCAL, GSN
Electronics	KiCad, SV
Supply chain	CycloneDX

Software repositories are one slice of what engineers build and version. Physical systems are designed in interconnected artifact types: SysML v2 for systems architecture, AADL for embedded software-hardware binding, OSCAL for security controls, KiCad for electronics, CycloneDX for supply chain provenance. Each has structured relationships, qualified names, and cross-file reference semantics. Each has the same retrieval problem that motivated Orbit for code.

GitLab is already the system of record for source code, CI pipelines, issues, and merge requests. Engineers who build physical systems use those same workflows, but their design artifacts (SysML models, AADL specifications, safety cases) live outside the platform in specialized tools with no AI integration path. The tree-sitter grammar approach (curated corpus, iterative coverage measurement, structural query layer) is directly reusable for any language with a formal grammar. Each new grammar brings another engineering artifact type into the same knowledge graph that Orbit already provides for code.

The practical vision: a GitLab instance where a systems engineer pushes a SysML v2 model, a safety engineer pushes an OSCAL control set, and a hardware engineer pushes a KiCad schematic, and all three artifacts are indexed into the same knowledge graph that already understands the C++ and Python in the repo. An agent working in that environment could answer questions that span domains: "if I change the mass budget on this part, what requirements does it affect, and which tests need to rerun?" That is not possible today because the engineering artifacts are opaque to the platform. Tooling that makes them legible, to both humans browsing a merge request and agents reasoning about a change, reduces friction for the people doing the work and increases the leverage of GitLab as the place where complex systems get built.

SysML v2 is the natural starting point because it explicitly models interfaces between Requirements, Functional, Logical, and Physical domains (RFLP). It is the connective tissue between the other artifact types. A knowledge graph that understands SysML v2 structure has a schema for connecting everything else.

Composable Systems

Method	Tok/task	Score
CLI search	44,024	0.880
MCP both	55,677	0.859

Same tools, same model, same tasks. CLI used 21% fewer tokens.

We spent the first two weeks of this project building an MCP server. That was the obvious approach: the protocol is well-supported, the ecosystem is growing, and it felt like the right abstraction. Then we talked to Angelo and pivoted to a CLI tool within days. It matched what we were already seeing in our own LLM experimentation: composable CLI commands, piped together in a shell, consistently outperformed the MCP transport in practice. The MCP server is still there (built into the CLI binary), but the CLI is the primary interface and the one we benchmark against.

The benchmark confirms the intuition. MCP (JSON-RPC transport) and CLI (direct function calls) delivering the same tools to the same model on the same tasks: the CLI approach used 21% fewer tokens per task while scoring slightly higher (0.880 vs 0.859 on discovery). Single comparison, single corpus, should not be over-interpreted. But it raises a question worth investigating: whether lightweight, composable tool interfaces are more token-efficient than heavier protocol abstractions for structured retrieval tasks.

We are curious whether embedding structured tool access inside an existing developer workflow tool like glab could compound these efficiency gains. A CLI that inherits GitLab as a system of record, with issue state, merge request context, and review history already available, would not need to retrieve that context through tool calls at all. The model would start each interaction with relevant project state already in scope.

What's Next

Releasing

All three repos are live at gitlab.com/nomograph. MIT licenses, full documentation, reproducible experiment instructions. nomograph.ai carries project pages and benchmark results. The benchmark is being published as a Python package, a HuggingFace dataset, and a Docker container for fully reproducible evaluation.

Iterating

Now that this work is public, we have a place to do it in the open. Sharing these repos was a major milestone for the initiative. We have a long list of enhancements: better scoring, more task categories, additional corpora, and deeper analysis of the observations that showed the largest effects.

Publishing

Three arXiv preprints in preparation. Paper A argues that representation matters more than retrieval, with full statistical apparatus; also submitting to GVSETS 2026. Paper B argues that aggregate benchmarks hide task-level structure (O1, O10, O8). Paper C is the benchmark paper itself: task design, scoring methodology, baseline results, and the community contribution model for extending the benchmark to new corpora and languages.

The immediate next research direction is the O10 scaling problem: the unsolved challenge for any system operating on real-sized repositories. Collaboration welcome.

Why the Benchmark Matters

sysml-bench is being published as a standalone benchmark paper and HuggingFace dataset, following the pattern of SWE-bench and BigCodeBench.

Building the benchmark was not a side activity. It was the critical methodological investment that made every other finding possible. Without per-task structured scoring on a fixed corpus, we would have been evaluating tool configurations by vibes. The benchmark forced us to define what "comprehension" means for each task type, write ground truth that could be verified against the SysML v2 specification, and measure effects at a granularity where real patterns become visible.

The most important thing the benchmark revealed was that aggregate accuracy hides everything interesting. If we had only measured mean accuracy across all tasks, we would have concluded that graph tools have no effect (O1 aggregate: p=0.391). The per-task analysis showed effects ranging from −0.400 to +0.800 on individual tasks, with clear structural predictors of which direction. That insight only exists because the benchmark was designed to measure at the task level from the start.

We are curious about how GKG evaluates Orbit's retrieval quality. The public-facing materials describe indexing benchmarks (throughput, latency, coverage), but we have not seen a per-task evaluation harness that measures whether the retrieved context actually helps the model answer correctly on specific question types. If that exists internally, we would be very interested in comparing methodologies. If it does not, the sysml-bench approach (fixed task corpus, structured scoring, per-task effect sizes) might be directly applicable to code comprehension evaluation. The benchmark harness is MIT-licensed and designed to be extended to new domains.

Observation Details

Expand any observation for the full analysis. The p-values below are from individual tests and have not been corrected for running 14 tests simultaneously. When corrected, none remain significant across the full set. O4 and O12 remain significant if treated as the only two hypotheses under test, which is how we plan to structure the confirmatory follow-up.

O12 — Context engineering outperforms tool restriction

The naive response to "too many tools hurt performance" is to restrict the tool set. The better response is a sentence in the system prompt. When agents are instructed to start with search and read_file, escalating to graph tools only when search is insufficient, the 13-point discovery penalty from over-tooling disappears entirely. Performance with 6 tools matches and marginally exceeds the 2-tool baseline (0.887 vs 0.880).

Guided render: paired t p=0.009 (uncorrected), d=0.75, N=16 tasks. The affected tasks (D11, D12, D16, D6) are those where unguided agents select structurally complex tools for attribute-lookup tasks that search handles trivially. Power: 0.80 (the only adequately powered observation in the study).

GKG implication: Tool selection guidance is the highest-leverage intervention we measured. If Orbit agents have access to graph traversal, vector search, and keyword search simultaneously, a system prompt that tells the agent when to use which tool may matter more than the tools themselves.

Update (ablation, discovery tasks, N=5): Schema ablation confirmed this from a different angle. The discovery-task penalty was caused by sysml_inspect creating a selection trap — the model calls inspect with an exact name instead of using the more informative search+read pattern. Removing that one tool recovered discovery performance to 0.925 vs 0.872 for search-only (d=0.55, p=0.045). The implication for Orbit: when a tool set underperforms, audit individual tools for selection traps before restricting the set. One badly designed tool can drag down an otherwise effective configuration.

O8 — Retrieval strategy interacts with task type

CLI knowledge graph dominated structured lookup (+29pp over RAG, paired t p=0.021 uncorrected, d=0.64, N=16 tasks). RAG edged ahead on cross-file reasoning (+14pp, p=0.403, not significant), likely because it injects all relevant context at once, avoiding the problem where the agent runs out of turns before it can chain together enough tool calls to answer multi-step questions.

The CLI advantage on discovery is driven by 5 tasks where RAG scores 0.000: tasks requiring iterative tool-mediated retrieval that single-shot context injection cannot perform.

GKG implication: Neither retrieval architecture is universally better. This validates a hybrid approach. The question for GKG is whether task classification can be done cheaply enough to route queries to the right retrieval strategy at runtime.

O1 — Tool-task interaction is heterogeneous

Graph tools appeared to hurt discovery tasks, help layer tasks, and are near-neutral on reasoning. The aggregate difference is not statistically significant (paired t-test p=0.391, N=16) because the effect is task-dependent. However, schema ablation on discovery tasks (N=5) revealed that the discovery penalty was caused by one specific tool (sysml_inspect) creating a selection trap — not by graph tools inherently. With that tool removed, the remaining 5-tool graph configuration scored 0.925 on discovery tasks vs 0.872 for the 2-tool search baseline.

The pattern holds across all four models tested, making it one of the most robust qualitative observations in the benchmark despite the null aggregate test.

GKG implication: If GKG evaluates Orbit with aggregate accuracy across task types, it is probably hiding this same structure. Per-task analysis with paired effect sizes is necessary to surface real patterns. The aggregate null is not "no effect." It is "large effects in both directions, hidden by averaging."

O10 — Corpus scale is the dominant difficulty factor

Performance roughly halves from 19 to 95 files (0.880 → 0.423). At scale, additional tools did not help — the bottleneck is not retrieval quality but reasoning depth and turn budget. (At small scale, our ablation study showed the graph tool penalty was from selection confusion, not schema overhead — this distinction may matter differently at scale.) 11 of 20 scaling tasks fall below 0.333. The distribution is bimodal: easy tasks remain easy, hard tasks become impossible.

Failure modes at 95 files: 55% budget exhaustion, 27% reasoning errors, 0% search failure.

GKG implication: This is the regime GKG operates in. Real codebases are hundreds or thousands of files. Small-corpus benchmarks produce optimistic estimates that do not transfer. The failure mode is reasoning depth, not retrieval, which suggests the path forward may be better orchestration and representation, not better search.

O4 — Pre-rendered views outperform agentic assembly

On explanation tasks (N=10 runs per task), pre-rendered model views scored 0.893 vs 0.558 for agentic assembly, a 34pp gap (t-test p=0.025, Wilcoxon p=0.031, d=1.01). The effect strengthened at N=10 (up from d=0.83 at N=5). Task E4 shows a perfect 1.0→0.0 cliff: rendering enables it entirely, graph assembly cannot solve it across 10 attempts. Rendering is also 4× cheaper ($6.23 vs $24.76). The advantage is explanation-specific. On discovery tasks, pre-rendering scored 0.719, worse than search (0.880).

GKG implication: This observation is specific to explanation tasks on SysML models and may not transfer directly to code comprehension. But the principle (pre-computation at index time outperforms traversal capability at query time) is worth testing. If Orbit pre-renders dependency summaries, call graphs, or module overviews at index time, it may outperform giving the agent traversal tools to assemble the same information at query time.

Limitations

N=3–10 from a single researcher. No independent replication yet.

Largest corpus tested: 95 files. Production repos: thousands.

Nights and weekends. This work was done outside of working hours alongside a full-time graduate capstone and two young children. We are genuine believers in context-driven development: spec-first workflows with AI assistance let us do more rigorous exploration with less time than we have ever managed before. What we are sharing is because we are excited about this problem area and want to work on it seriously. We would have loved more time to strengthen the statistics and expand the corpus. The benchmark methodology is sound; the sample sizes reflect the constraints of a solo researcher, not a lack of rigor. Everything here already exceeds the scope of the original graduate capstone that started it.

Single domain and corpus. All results are from SysML v2 models. The Eve corpus is purpose-built; production models from defense or industrial programs may differ. Whether patterns generalize to other engineering languages (AADL, OSCAL, SystemVerilog) or to code is untested.

Exploratory design. This study had no pre-registered hypotheses (we did not declare what we expected to find before running the experiments). 14 observations were tested; when we correct for running that many tests simultaneously, none of the results remain statistically significant. O4 and O12 survive correction if we treat them as the only two hypotheses being tested, but that designation was made after seeing the data. A properly powered confirmatory study on a second corpus is the next step.

Scale ceiling. O10 shows the scaling problem is real; behavior at thousands of files is unknown. The failure mode is that the agent runs out of turns before it finishes reasoning, not that it fails to find the right information. That suggests the path forward is better orchestration, not better search.

Circular validation. Task selection, ground truth annotation, and scoring rubrics were authored by the same researcher who built the tools under evaluation. No independent annotators reviewed ground truth labels. Independent replication is needed before treating these results as externally valid.

Statistical Context

Power analysis: how many tasks we would need to detect each effect reliably (80% of the time).

Only one observation (O12, d=0.75, N=16 tasks) has enough statistical power to reliably detect its effect. O8 (d=0.64, power=0.70) and O4 (d=1.01, power=0.69 at N=10) are close. Everything else would need substantially more tasks to confirm. The power analysis itself is useful: it tells us exactly how large a follow-up study needs to be, which makes the confirmatory work designable rather than speculative.

Observation	Effect size (d)	Current tasks	Power	Tasks needed for 80%
O12 (guided render)	0.75	16	0.80	17
O8 (CLI vs RAG, discovery)	0.64	16	0.70	20
O4 (render vs assembly)	1.01	10×8	0.69	10
O1 (heterogeneity)	0.22	16	0.13	163

We should also be transparent about compute constraints. Our professional role at GitLab is in revenue, not engineering, and we have been cautious about token consumption through Duo, not wanting to be an outlier user on a resource we are not directly building. That has limited how many experiments we run and how deeply we iterate on the benchmark. We would like to spend more time and more tokens looking deeper: running the experiments that would strengthen these observations, exploring more formal model research approaches like model explanations and LangGraph-based analysis to get under the hood of why these effects occur, not just that they occur.