SysML v2 as a Composable Knowledge Graph

Andrew Dunn  ·  Nomograph Labs  ·  March 2026

The Short Version

Three artifacts, MIT-licensed:

tree-sitter-sysml
Parser. 192 tests. 89% external file coverage.

sysml
CLI tool (MCP server built in).
14 commands, 10 MCP tools.
9-signal hybrid index.

sysml-bench
132 tasks, 4 models,
40+ conditions, N=3–5.

We think we built the SysML v2 equivalent of what GKG is building for code: a tree-sitter parser, a knowledge graph, and a CLI tool with an MCP server built in. Then we ran a benchmark to measure what actually helps LLMs comprehend structured engineering models.

The short answer: how you present information to the model matters more than how you retrieve it. Most of the energy in the tool-augmented LLM space right now is going into retrieval infrastructure: vector databases, graph traversal, multi-hop reasoning chains. On our benchmark, none of those interventions moved the needle. What did move it was representation: pre-rendered views, tool selection guidance, and letting the agent retrieve information step by step in the right form through sequential tool calls.

This is an exploratory study on a single corpus. We are sharing it because the architectural parallel to GKG is direct, the findings are relevant to decisions your team is making now, and we would genuinely like to work on this problem alongside you.


What We Built

tree-sitter-sysml — Tree-sitter grammar for SysML v2 (OMG-adopted June 2025). Built by curating a corpus of real-world SysML v2 models and iterating the grammar against it: run the parser, find failures, fix the grammar, repeat. 89% coverage on external files. 192 tests passing. Bindings for Rust, C, Node.js, Python, Go, and Swift. MIT.

sysml — Rust CLI tool with an MCP server built in. Indexes .sysml repositories into a persistent knowledge graph and exposes it through 14 CLI commands and 10 MCP tools. 9-signal hybrid index: keyword scoring (8 signals including exact match, prefix, containment, vocabulary expansion, relationship adjacency) plus fastembed all-MiniLM-L6-v2 vector search (384-dim, HNSW) and 27 SysML v2 structural relationship types. 94% average token reduction vs raw file injection. 123 tests.

sysml-bench — Python evaluation harness. 132 tasks across 8 categories: discovery, reasoning, explanation, layer, boundary, vector-sensitive, structural trace, and corpus scaling. Per-field structured scoring (Bool, Float, Str, ListStr F1 with threshold). Corpus: Eve Online Mining Frigate SysML v2 model, 19 files, 798 elements, 1,515 relationships.

Architectural parallel with GKG (Orbit)

LayerGitLab GKGNomograph SysML
ParsingRust + tree-sitterRust + tree-sitter-sysml
Data modelKuzuDB (multi-hop)27-type SysML index
RetrievalHybrid (TERAG)9-signal hybrid
Token efficiency80–97% reduction94% average
InterfaceMCP serverCLI + MCP server

Different domain (Orbit indexes software repositories; we index SysML v2 models), same retrieval architecture. The benchmark observations translate directly to Orbit's design space. We think four of them are worth your team's attention.


Key Findings

Four observations from our benchmark that we think map directly onto GKG's design space. All are exploratory: single corpus, small sample sizes, and the statistical significance does not hold up when you account for running 14 tests at once. We present them as hypotheses worth testing on code, not as confirmed results.

O12 — Guided tool selection GKG context: If Orbit agents underperform with the full tool set, the fix may be a sentence in the system prompt, not restricting tools.
One sentence of selection guidance eliminated a 13-point over-tooling penalty (0.887 vs 0.750). Strongest statistical signal in the study (p=0.009, large effect).
O8 — CLI search vs RAG by task type GKG context: Validates the hybrid architecture. Neither retrieval strategy is universally better. Task type should inform which path.
CLI tool-based search outperformed RAG on discovery tasks by 29 points (p=0.021). RAG edged ahead on reasoning tasks by 14 points (not statistically significant).
O1 — Aggregate metrics hide task-level structure GKG context: If GKG evaluates with a single accuracy number across task types, it is probably hiding the same structure we found. You have to look at each task type separately.
Overall tool comparison: no significant difference. But individual tasks ranged from −0.400 to +0.800, meaning tools helped enormously on some tasks and hurt on others.
O10 — Corpus scale collapses performance GKG context: This is the regime GKG operates in. Real codebases are hundreds or thousands of files. Our small-corpus results may not transfer.
0.880 at 19 files → 0.423 at 95 files. 55% of the time the agent ran out of turns before finishing, not because it couldn't find the right information.

Supporting: O4 (pre-rendered views outscored letting the agent assemble its own context on explanation tasks; largest measured effect but specific to explanation tasks and less directly relevant to code comprehension). Full details below.


The Larger Opportunity

Engineering artifact types that share the same retrieval problem:

DomainLanguage
SystemsSysML v2
EmbeddedAADL
SafetyOSCAL, GSN
ElectronicsKiCad, SV
Supply chainCycloneDX

Software repositories are one slice of what engineers build and version. Physical systems are designed in interconnected artifact types: SysML v2 for systems architecture, AADL for embedded software-hardware binding, OSCAL for security controls, KiCad for electronics, CycloneDX for supply chain provenance. Each has structured relationships, qualified names, and cross-file reference semantics. Each has the same retrieval problem that motivated Orbit for code.

GitLab is already the system of record for source code, CI pipelines, issues, and merge requests. Engineers who build physical systems use those same workflows, but their design artifacts (SysML models, AADL specifications, safety cases) live outside the platform in specialized tools with no AI integration path. The tree-sitter grammar approach (curated corpus, iterative coverage measurement, structural query layer) is directly reusable for any language with a formal grammar. Each new grammar brings another engineering artifact type into the same knowledge graph that Orbit already provides for code.

The practical vision: a GitLab instance where a systems engineer pushes a SysML v2 model, a safety engineer pushes an OSCAL control set, and a hardware engineer pushes a KiCad schematic, and all three artifacts are indexed into the same knowledge graph that already understands the C++ and Python in the repo. An agent working in that environment could answer questions that span domains: "if I change the mass budget on this part, what requirements does it affect, and which tests need to rerun?" That is not possible today because the engineering artifacts are opaque to the platform. Tooling that makes them legible, to both humans browsing a merge request and agents reasoning about a change, reduces friction for the people doing the work and increases the leverage of GitLab as the place where complex systems get built.

SysML v2 is the natural starting point because it explicitly models interfaces between Requirements, Functional, Logical, and Physical domains (RFLP). It is the connective tissue between the other artifact types. A knowledge graph that understands SysML v2 structure has a schema for connecting everything else.


Composable Systems
MethodTok/taskScore
CLI search44,0240.880
MCP both55,6770.859

Same tools, same model, same tasks. CLI used 21% fewer tokens.

We spent the first two weeks of this project building an MCP server. That was the obvious approach: the protocol is well-supported, the ecosystem is growing, and it felt like the right abstraction. Then we talked to Angelo and pivoted to a CLI tool within days. It matched what we were already seeing in our own LLM experimentation: composable CLI commands, piped together in a shell, consistently outperformed the MCP transport in practice. The MCP server is still there (built into the CLI binary), but the CLI is the primary interface and the one we benchmark against.

The benchmark confirms the intuition. MCP (JSON-RPC transport) and CLI (direct function calls) delivering the same tools to the same model on the same tasks: the CLI approach used 21% fewer tokens per task while scoring slightly higher (0.880 vs 0.859 on discovery). Single comparison, single corpus, should not be over-interpreted. But it raises a question worth investigating: whether lightweight, composable tool interfaces are more token-efficient than heavier protocol abstractions for structured retrieval tasks.

We are curious whether embedding structured tool access inside an existing developer workflow tool like glab could compound these efficiency gains. A CLI that inherits GitLab as a system of record, with issue state, merge request context, and review history already available, would not need to retrieve that context through tool calls at all. The model would start each interaction with relevant project state already in scope.


What's Next

Releasing

All three repos are live at gitlab.com/nomograph. MIT licenses, full documentation, reproducible experiment instructions. nomograph.ai carries project pages and benchmark results.

Iterating

Now that this work is public, we have a place to do it in the open. Sharing these repos was a major milestone for the initiative. We have a long list of enhancements: better scoring, more task categories, additional corpora, and deeper analysis of the observations that showed the largest effects.

Publishing

Two arXiv preprints in preparation. The first argues that representation matters more than retrieval (O4, O12 as the designated primary hypothesis pair). The second argues that aggregate benchmarks hide task-level structure (O1, O10, O8). Both frame the current study as exploratory with a confirmatory design for follow-up on a second corpus.

The immediate next research direction is the O10 scaling problem: the unsolved challenge for any system operating on real-sized repositories. Collaboration welcome.


Observation Details

Expand any observation for the full analysis. The p-values below are from individual tests and have not been corrected for running 14 tests simultaneously. When corrected, none remain significant across the full set. O4 and O12 remain significant if treated as the only two hypotheses under test, which is how we plan to structure the confirmatory follow-up.

O12 — Context engineering outperforms tool restriction

The naive response to "too many tools hurt performance" is to restrict the tool set. The better response is a sentence in the system prompt. When agents are instructed to start with search and read_file, escalating to graph tools only when search is insufficient, the 13-point discovery penalty from over-tooling disappears entirely. Performance with 6 tools matches and marginally exceeds the 2-tool baseline (0.887 vs 0.880).

Guided render: paired t p=0.009 (uncorrected), d=0.75, N=16 tasks. The affected tasks (D11, D12, D16, D6) are those where unguided agents select structurally complex tools for attribute-lookup tasks that search handles trivially. Power: 0.80 (the only adequately powered observation in the study).

GKG implication: Tool selection guidance is the highest-leverage intervention we measured. If Orbit agents have access to graph traversal, vector search, and keyword search simultaneously, a system prompt that tells the agent when to use which tool may matter more than the tools themselves.

O8 — Retrieval strategy interacts with task type

CLI knowledge graph dominated structured lookup (+29pp over RAG, paired t p=0.021 uncorrected, d=0.64, N=16 tasks). RAG edged ahead on cross-file reasoning (+14pp, p=0.403, not significant), likely because it injects all relevant context at once, avoiding the problem where the agent runs out of turns before it can chain together enough tool calls to answer multi-step questions.

The CLI advantage on discovery is driven by 5 tasks where RAG scores 0.000: tasks requiring iterative tool-mediated retrieval that single-shot context injection cannot perform.

GKG implication: Neither retrieval architecture is universally better. This validates a hybrid approach. The question for GKG is whether task classification can be done cheaply enough to route queries to the right retrieval strategy at runtime.

O1 — Tool-task interaction is heterogeneous

Graph tools hurt discovery tasks, help layer tasks, and are near-neutral on reasoning. The aggregate difference is not statistically significant (paired t-test p=0.391, N=16) because the effect is task-dependent: graph tools help on tasks requiring structural completeness checking (D10, D13: +0.300 to +0.400) and hurt on tasks where search retrieves the answer directly (D11, D6: −0.600 to −0.800).

The pattern holds across all four models tested, making it one of the most robust qualitative observations in the benchmark despite the null aggregate test.

GKG implication: If GKG evaluates Orbit with aggregate accuracy across task types, it is probably hiding this same structure. Per-task analysis with paired effect sizes is necessary to surface real patterns. The aggregate null is not "no effect." It is "large effects in both directions, hidden by averaging."

O10 — Corpus scale is the dominant difficulty factor

Performance roughly halves from 19 to 95 files (0.880 → 0.423). Graph tools and vector search make things worse at scale: schema overhead and retrieval noise compound without compensating signal. The bottleneck is not retrieval quality; it is reasoning depth and turn budget. 11 of 20 scaling tasks fall below 0.333. The distribution is bimodal: easy tasks remain easy, hard tasks become impossible.

Failure modes at 95 files: 55% budget exhaustion, 27% reasoning errors, 0% search failure.

GKG implication: This is the regime GKG operates in. Real codebases are hundreds or thousands of files. Small-corpus benchmarks produce optimistic estimates that do not transfer. The failure mode is reasoning depth, not retrieval, which suggests the path forward may be better orchestration and representation, not better search.

O4 — Pre-rendered views outperform agentic assembly

On explanation tasks, pre-rendered model views scored 0.873 vs 0.490 for agentic assembly, a 38pp gap (Wilcoxon p=0.047 uncorrected, d=0.83, N=8 tasks). Two tasks collapsed entirely without rendering: 1.000 with a pre-rendered view, 0.000 with agentic assembly. The advantage is explanation-specific. On discovery tasks, pre-rendering scored 0.719, worse than search (0.880).

GKG implication: This observation is specific to explanation tasks on SysML models and may not transfer directly to code comprehension. But the principle (pre-computation at index time outperforms traversal capability at query time) is worth testing. If Orbit pre-renders dependency summaries, call graphs, or module overviews at index time, it may outperform giving the agent traversal tools to assemble the same information at query time.


Limitations

N=3–5 from a single researcher. No independent replication yet.

Largest corpus tested: 95 files. Production repos: thousands.

Nights and weekends. This work was done outside of working hours alongside a full-time graduate capstone and two young children. We are genuine believers in context-driven development: spec-first workflows with AI assistance let us do more rigorous exploration with less time than we have ever managed before. What we are sharing is because we are excited about this problem area and want to work on it seriously. We would have loved more time to strengthen the statistics and expand the corpus. The benchmark methodology is sound; the sample sizes reflect the constraints of a solo researcher, not a lack of rigor. Everything here already exceeds the scope of the original graduate capstone that started it.

Single domain and corpus. All results are from SysML v2 models. The Eve corpus is purpose-built; production models from defense or industrial programs may differ. Whether patterns generalize to other engineering languages (AADL, OSCAL, SystemVerilog) or to code is untested.

Exploratory design. This study had no pre-registered hypotheses (we did not declare what we expected to find before running the experiments). 14 observations were tested; when we correct for running that many tests simultaneously, none of the results remain statistically significant. O4 and O12 survive correction if we treat them as the only two hypotheses being tested, but that designation was made after seeing the data. A properly powered confirmatory study on a second corpus is the next step.

Scale ceiling. O10 shows the scaling problem is real; behavior at thousands of files is unknown. The failure mode is that the agent runs out of turns before it finishes reasoning, not that it fails to find the right information. That suggests the path forward is better orchestration, not better search.

Circular validation. Task selection, ground truth annotation, and scoring rubrics were authored by the same researcher who built the tools under evaluation. No independent annotators reviewed ground truth labels. Independent replication is needed before treating these results as externally valid.


Statistical Context

Power analysis: how many tasks we would need to detect each effect reliably (80% of the time).

Only one observation (O12, d=0.75, N=16 tasks) has enough statistical power to reliably detect its effect. O8 (d=0.64, power=0.70) and O4 (d=0.83, power=0.53) are close. Everything else would need substantially more tasks to confirm. The power analysis itself is useful: it tells us exactly how large a follow-up study needs to be, which makes the confirmatory work designable rather than speculative.

ObservationEffect size (d)Current tasksPowerTasks needed for 80%
O12 (guided render)0.75160.8017
O8 (CLI vs RAG, discovery)0.64160.7020
O4 (render vs assembly)0.8380.5314
O1 (heterogeneity)0.22160.13163

We should also be transparent about compute constraints. Our professional role at GitLab is in revenue, not engineering, and we have been cautious about token consumption through Duo, not wanting to be an outlier user on a resource we are not directly building. That has limited how many experiments we run and how deeply we iterate on the benchmark. We would like to spend more time and more tokens looking deeper: running the experiments that would strengthen these observations, exploring more formal model research approaches like model explanations and LangGraph-based analysis to get under the hood of why these effects occur, not just that they occur.