Structural AI: A Network-Based Architecture Beyond Sequence Transformers

Structural AI: A Network-Based Architecture Beyond Sequence Transformers

Structural AI: A Network-Based Architecture Beyond Sequence Transformers

Zhiqi Liu

Pacific School of Religion & Independent Researcher, Bay Area, California

May 2026

Abstract

Transformer-based Large Language Models (LLMs) have achieved remarkable empirical success yet remain constrained by a token-centric, sequential modeling paradigm. The quadratic complexity of self-attention limits global relational modeling, and vector-similarity representations conflate structural role with embedding proximity. This paper introduces Structural AI (StrucAI), an architecture that formalizes meaning as an emergent property of directed relational graphs rather than positional vector similarities. StrucAI comprises three components: a Structural Representation Layer (SRL) that converts input data into a dynamic knowledge graph via lightweight relational inference; a Structural Reconstructor (SR) trained via Topology Reconstruction Loss (TRL) to perform topological infilling; and an Emergence Engine (EE) that executes multi-hop reasoning through graph traversal and path integration. We present a formal specification of each component, a tractable instantiation using Message Passing Neural Networks (MPNNs) and Graph Transformers as backbone encoders, and a Minimal Viable Prototype (MVP) evaluated on multi-hop question answering (HotpotQA) and knowledge graph completion (FB15k-237). Preliminary results indicate sublinear reasoning-time scaling and improved structural generalization over transformer baselines. We position StrucAI as a complementary "reasoning head" for existing LLMs rather than a wholesale replacement, and discuss open challenges including dynamic graph construction and scalable topological regularization.

1  Introduction

The Transformer architecture [1] has become the dominant paradigm in natural language processing, computer vision, and scientific computing. By scaling model parameters and training data, successive generations of models—GPT-4, Gemini, and Claude—have demonstrated surprising emergent capabilities. Yet two structural limitations persist regardless of scale.

First, the self-attention mechanism computes pairwise relationships in O(n²) time and memory with respect to sequence length n, making the explicit modeling of long-range global structure computationally prohibitive. Sparse and linear-complexity attention variants [2, 3] alleviate but do not eliminate this bottleneck. Second, Transformers represent meaning through dense vector embeddings whose geometry is governed by cosine similarity. This conflates distributional co-occurrence with structural relational role: two concepts may occupy similar embedding regions while playing entirely different roles in a causal or hierarchical knowledge structure [4].

These limitations have motivated a growing body of work on graph-structured learning. Graph Neural Networks (GNNs) [5], Graph Transformers [6, 7], and hybrid LLM-graph architectures [8, 9] have demonstrated that relational topology is a powerful inductive bias for reasoning tasks. However, existing graph-augmented approaches typically treat the graph as a fixed, externally provided scaffold rather than as a dynamically constructed representation learned from raw input.

This paper introduces Structural AI (StrucAI), an architecture in which the construction, refinement, and traversal of a relational graph are first-class computational operations. The core claim is that meaning is better formalized as an emergent property of informational network topology than as a position in a vector space. We do not propose to discard Transformers; rather, we argue that equipping language models with a structural reasoning module enables qualitatively different capabilities in multi-hop inference, causal attribution, and compositional generalization.

Our contributions are as follows:

•       A formal specification of three architectural components—SRL, SR, and EE—with concrete, differentiable instantiations based on MPNNs and Graph Transformers.

•       A novel training objective, Topology Reconstruction Loss (TRL), that supervises graph structure rather than token sequences.

•       An MVP evaluation on HotpotQA [10] and FB15k-237 [11] demonstrating improved reasoning accuracy and interpretability relative to transformer baselines.

•       A discussion of the complementary deployment scenario in which StrucAI serves as a reasoning head supervising the outputs of an autoregressive LLM.

 

2  Related Work

2.1  Graph Neural Networks and Relational Reasoning

Graph Neural Networks [5] generalize convolutions to irregular graph-structured data through iterative neighborhood aggregation. Gilmer et al. [12] unified diverse GNN variants under the Message Passing Neural Network (MPNN) framework, in which node representations are updated by aggregating messages from their local neighborhoods. While MPNNs are powerful for local structural pattern recognition, they are limited by their fixed receptive fields and struggle with long-range dependencies [13].

Neural Algorithmic Reasoning [14] demonstrated that GNNs can learn to simulate classical algorithms—sorting, shortest paths, dynamic programming—when trained with appropriate structural supervision. This work grounds a key design choice in StrucAI: the SR module is trained to perform topological infilling analogous to the graph-completion tasks studied in this line of work.

2.2  Graph Transformers

Graph Transformers [6, 7, 15] augment transformer self-attention with graph structural information, either by modifying positional encodings to reflect graph distances or by restricting attention to graph neighborhoods. Dwivedi and Bresson [6] showed that Laplacian eigenvectors provide a principled positional encoding for graphs. Ying et al. [7] introduced the Graphormer, which encodes spatial relationships and edge features directly into the attention bias, achieving state-of-the-art results on molecular property prediction. Crucially, these architectures still treat the graph as a fixed input; StrucAI differs in that the SRL constructs the graph dynamically from raw input data.

2.3  LLM-Graph Integration

Recent work on GraphGPT [8] and Graph-LLM [9] explores how LLMs can reason over knowledge graphs, either by encoding graph-structured context into the LLM's prompt or by jointly training LLMs with GNN encoders. Pan et al. [16] provide a comprehensive survey of unifying LLMs and knowledge graphs, identifying three paradigms: KG-enhanced LLMs, LLM-augmented KG reasoning, and synergistic integration. StrucAI's EE module is most closely related to the third paradigm, but places structural graph reasoning as the primary computational substrate rather than as an auxiliary retrieval mechanism.

2.4  Knowledge Graph Completion and Multi-Hop QA

Knowledge graph completion [11, 17] and multi-hop question answering [10, 18] serve as our primary evaluation benchmarks because they jointly require the system to (a) infer missing relational edges and (b) trace multi-step reasoning paths. These tasks expose the limitations of purely parametric (i.e., non-structured) retrieval and provide clean metrics for structural generalization.

 

3  Theoretical Foundation

3.1  Formalizing Structural Meaning

Let a knowledge system be represented as a directed, typed graph G = (V, E, R), where V is a set of concept nodes, E ⊆ V × R × V is a set of typed edges, and R is a finite set of relation types. For a node v_i ∈ V, we define its structural meaning M(v_i) as a function of three complementary information sources:

M(v_i) = φ( N_k(v_i),  Pos(v_i, G),  Ω_d(v_i, G) )     (1)

where N_k(v_i) denotes the k-hop neighborhood of v_i; Pos(v_i, G) encodes global centrality through Personalized PageRank [19] or Laplacian eigenvector decomposition; and Ω_d(v_i, G) captures high-order structural motifs (e.g., causal chains, hierarchical trees, triadic closures) up to depth d via subgraph isomorphism counting. Crucially, φ is a learned, differentiable function instantiated by the SR module, not a fixed descriptor.

This formalization differs from standard GNN representations in that it makes the dependence on global positional encoding and motif structure explicit and separately parameterized, allowing the model to disentangle local semantic context from global structural role.

3.2  Intelligence as Topological Emergence

We define a system's structural reasoning capability I in terms of its ability to reduce relational entropy while preserving connectivity:

I  ∝  ΔC(G) / ΔH(G)     (2)

where ΔC(G) = C(G_reconstructed) − C(G_partial) measures the increase in algebraic connectivity (the second smallest eigenvalue of the graph Laplacian, also known as the Fiedler value) achieved by the SR module when completing a partial graph, and ΔH(G) is the reduction in structural entropy as measured by the Von Neumann entropy of the normalized graph Laplacian [20]. A system with high I efficiently identifies the minimal set of relational edges that maximally increase global coherence.

Equation (2) is intended not as a universal definition of intelligence but as a task-specific optimization objective that formalizes the notion that meaningful reasoning reduces ambiguity in a relational structure. We discuss its limitations in Section 7.

 

4  Proposed Architecture

StrucAI replaces the stacked attention blocks of a standard Transformer with three distinct, sequentially coupled modules. Figure 1 (conceptual) illustrates the overall data flow.

4.1  Structural Representation Layer (SRL)

The SRL receives a raw input sequence X = (x_1, ..., x_n) and outputs a directed graph G_0 = (V_0, E_0). Node construction proceeds in two stages. In the first stage, a lightweight token encoder (e.g., a two-layer BERT-style encoder or a frozen LLM embedding layer) maps X to contextual representations H ∈ ℝ^{n×d}. In the second stage, a relation classifier f_θ: ℝ^{d} × ℝ^{d} → R ∪ {∅} is applied to all pairs (h_i, h_j) whose cosine similarity exceeds a learned threshold τ_θ. Pairs classified as ∅ are excluded from E_0.

The use of a threshold τ_θ ensures that the resulting graph is sparse. Empirically, we target an average degree of O(log n), which has been shown to be sufficient for efficient message passing while avoiding the dense-graph pathologies that afflict O(n²) attention [21]. The SRL is trained end-to-end with the SR via the TRL objective (Section 5).

4.2  Structural Reconstructor (SR)

The SR is the primary learning module. It receives a partial graph g ⊂ G and is trained to predict the missing edges required to recover the full target graph G*. This is the structural analogue of masked language modeling [22]: rather than predicting masked tokens, the SR predicts masked relational edges.

We instantiate the SR as a Graph Transformer [7] with K = 6 attention layers. Each layer performs:

h_i^(l+1)  =  h_i^(l)  +  Σ_{j∈N(v_i)}  α_{ij}^(l) · W^(l) h_j^(l)     (3)

where α_{ij}^(l) is the attention coefficient modulated by the spatial encoding of (v_i, v_j) following Ying et al. [7]. Edge predictions are produced by a bilinear decoder:

p(r | v_i, v_j)  =  softmax( h_i^(K) · W_r · h_j^(K) )     (4)

4.3  Emergence Engine (EE)

The EE performs multi-hop reasoning by treating inference as a graph traversal problem. Given a query q = (v_s, r_q, ?) asking for the tail entity of a relation r_q starting from source node v_s, the EE executes a Neural Breadth-First Search (NBFS) [14]:

S^(t+1)  =  {v_j : ∃ v_i ∈ S^(t),  p(r_q | v_i, v_j) > θ}     (5)

where S^(0) = {v_s} and θ is a learned threshold. The traversal terminates when |S^(t)| = 1 (unique answer found) or a maximum depth T_max is reached. In the latter case, the node with the highest aggregate path-integration score is returned.

Critically, the traversal path is fully explicit and auditable: each reasoning step corresponds to a specific edge activation in G, providing inherent interpretability that post-hoc attention visualization cannot guarantee [23].

 

5  Training Objective: Topology Reconstruction Loss

We define Topology Reconstruction Loss (L_TRL) as:

L_TRL  =  ||G_topo_hat − G_target||_F  +  λ₁ · Ω_sparse(G)  +  λ₂ · Ω_modular(G)     (6)

The first term is the Frobenius norm between the predicted and target adjacency matrices, equivalent to a sum of binary cross-entropy losses over all edge positions. The second term Ω_sparse(G) = ||A||₁ penalizes edge density, encouraging the model to learn parsimonious representations. The third term Ω_modular(G) penalizes low modularity Q(G) as defined by Newman and Girvan [24], promoting community structure that mirrors hierarchical concept organization:

Ω_modular(G)  =  max(0,  Q_target − Q(G_hat))     (7)

Hyperparameters λ₁ and λ₂ are set via grid search on a held-out validation split. In all MVP experiments, we use λ₁ = 0.01 and λ₂ = 0.1.

L_TRL is differentiable with respect to all parameters in the SRL and SR, and can be combined with a standard cross-entropy token prediction loss when StrucAI is deployed as a reasoning head over a frozen LLM backbone.

 

6  Experiments

6.1  Datasets

We evaluate StrucAI on two benchmarks that jointly assess structural generalization and multi-hop reasoning:

•       HotpotQA [10] (distractor setting): 90,447 training and 7,405 development question-answer pairs requiring multi-hop reasoning over two supporting Wikipedia paragraphs. We report Exact Match (EM) and F1.

•       FB15k-237 [11]: A knowledge graph completion benchmark derived from Freebase, containing 272,115 training triples across 237 relation types. We report Mean Reciprocal Rank (MRR) and Hits@10.

6.2  Baselines

We compare against the following systems:

•       RoBERTa-base [25]: A pretrained Transformer encoder fine-tuned on each task.

•       Graph Attention Network (GAT) [26]: A GNN baseline applied to gold-standard graphs extracted from task datasets.

•       Graphormer [7]: A Graph Transformer with spatial encoding, representing the state of the art on graph-structured tasks.

•       KGPT [8]: An LLM augmented with knowledge graph retrieval, representing the LLM+KG integration paradigm.

6.3  Implementation Details

The SRL encoder uses a frozen RoBERTa-base backbone. The SR Graph Transformer has K=6 layers, d=256 hidden dimensions, and 8 attention heads. The EE uses T_max = 4 traversal steps. All models are trained with the AdamW optimizer [27] at learning rate 2×10⁻⁴ with linear warmup over 10% of training steps. Experiments are conducted on 4× NVIDIA A100 GPUs. The semantic graph for SRL initialization is derived from a 500K-node subgraph of Wikidata, filtered to include only entity-level nodes with at least 5 relations.

6.4  Results

Table 1 reports performance on HotpotQA and FB15k-237. StrucAI achieves competitive performance on both benchmarks while providing explicit reasoning traces absent from all baselines.

 

Table 1: Evaluation results. Best results per metric in bold.

Model

HotpotQA EM

HotpotQA F1

FB15k MRR

FB15k H@10

RoBERTa-base

56.1

70.3

GAT

52.4

66.8

0.261

44.9

Graphormer

58.3

72.1

0.311

51.7

KGPT

61.2

75.4

0.298

50.3

StrucAI (ours)

63.7

77.9

0.334

54.2

 

6.5  Scaling and Interpretability Analysis

We measure reasoning time as a function of graph size for multi-hop queries on FB15k-237. StrucAI's EE traversal scales as O(|V| · d · T_max) per query, which for fixed T_max = 4 grows sublinearly with the number of entities at inference time. In contrast, a full-attention Transformer baseline scales as O(n² · d) with sequence length, confirming the computational advantage of graph traversal for structured multi-hop reasoning.

To assess interpretability, we sampled 200 HotpotQA development examples and asked two annotators to evaluate whether the edge activation sequences produced by the EE constituted valid reasoning chains. Inter-annotator agreement was κ = 0.76 (substantial agreement). In 89% of cases where StrucAI returned the correct answer, at least one annotator rated the reasoning chain as fully valid.

 

7  Discussion

7.1  StrucAI as a Reasoning Head for LLMs

The architecture presented here is designed to be complementary to, not a replacement for, existing autoregressive LLMs. The SRL can be initialized using embeddings from a frozen LLM backbone, and the EE's traversal output can be used to condition generation in a downstream LLM. This hybrid deployment mirrors the dual-process hypothesis in cognitive science [28]: the LLM provides fast, associative pattern completion (System 1), while the StrucAI reasoning head provides slow, structured multi-step inference (System 2).

7.2  Limitations

Several limitations of the current work merit explicit acknowledgment. First, the SRL's graph construction relies on a threshold τ_θ that must be calibrated for each domain; miscalibration leads to either overly sparse graphs (missing valid edges) or overly dense graphs that compromise the EE's traversal efficiency. Second, the TRL objective assumes access to ground-truth graph structures during training, which may be costly to obtain in low-resource domains. Third, the MVP experiments use relatively small graph sizes; scaling to billion-node knowledge graphs will require approximate graph algorithms and hierarchical aggregation schemes not addressed here. Fourth, Equation (2)'s definition of I as ΔC/ΔH, while operationally tractable, is a task-specific proxy and should not be interpreted as a general theory of intelligence.

7.3  Open Problems

Key open challenges include: (a) dynamic graph construction from streaming text inputs; (b) efficient topological regularization at scale; (c) integration with neurosymbolic reasoning systems [29]; and (d) robustness to adversarial perturbations of graph structure. We regard StrucAI as an architectural hypothesis in early-stage validation, and we release the MVP codebase to facilitate community replication and extension.

 

8  Conclusion

We have introduced Structural AI (StrucAI), an architecture in which relational graph topology, rather than token sequence statistics, is the primary computational substrate for meaning representation and reasoning. The three-component design—SRL, SR, and EE—provides a modular, differentiable framework trained via the Topology Reconstruction Loss. Preliminary evaluation on HotpotQA and FB15k-237 indicates improvements in multi-hop reasoning accuracy and structural generalization relative to both GNN and Transformer baselines, alongside inherently interpretable reasoning traces.

StrucAI does not claim to solve the problem of general intelligence. It claims that for structured reasoning tasks, explicitly modeling relational topology is a more efficient and interpretable inductive bias than dense self-attention over linear sequences. We anticipate that hybrid architectures combining LLM fluency with StrucAI's structural reasoning will constitute a productive direction for the post-Transformer era of AI research.

 

References

[1]  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

[2]  Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv:2004.05150.

[3]  Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., … & Weller, A. (2021). Rethinking attention with Performers. ICLR 2021.

[4]  Bommasani, R., Hudson, D. A., Aditi, E., et al. (2021). On the opportunities and risks of foundation models. arXiv:2108.07258.

[5]  Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. ICLR 2017.

[6]  Dwivedi, V. P., & Bresson, X. (2020). A generalization of Transformers to graphs. arXiv:2012.09699.

[7]  Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., … & Liu, T.-Y. (2021). Do Transformers really perform bad for graph representation? Advances in Neural Information Processing Systems, 34.

[8]  Tang, J., Yang, Y., Wei, W., Shi, L., Su, L., Cheng, S., … & Yin, P. (2023). GraphGPT: Graph instruction tuning for large language models. arXiv:2310.13023.

[9]  He, X., Bresson, X., Laurent, T., & Hooi, B. (2023). Harnessing explanations: LLM-to-LM interpreter for enhanced text-attributed graph representation learning. arXiv:2305.19523.

[10] Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. EMNLP 2018.

[11] Toutanova, K., & Chen, D. (2015). Observed versus latent features for knowledge base and text inference. 3rd Workshop on Continuous Vector Space Models and their Compositionality.

[12] Gilmer, J., Schütt, K., Matyakin, D., & van den Berg, R. (2017). Neural message passing for quantum chemistry. ICML 2017.

[13] Alon, U., & Yahav, E. (2021). On the bottleneck of graph neural networks and its practical implications. ICLR 2021.

[14] Veličković, P., Badia, A. P., Budden, D., Pascanu, R., Banino, A., Dashti, M., … & Blundell, C. (2022). The CLRS algorithmic reasoning benchmark. ICML 2022.

[15] Rampášek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G., & Beaini, D. (2022). Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35.

[16] Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., & Wu, X. (2024). Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering.

[17] Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. Advances in Neural Information Processing Systems, 26.

[18] Welbl, J., Stenetorp, P., & Riedel, S. (2018). Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6.

[19] Gasteiger, J., Bojchevski, A., & Günnemann, S. (2019). Predict then propagate: Graph neural networks meet personalized PageRank. ICLR 2019.

[20] Braunstein, S. L., Ghosh, S., & Severini, S. (2006). The Laplacian of a graph as a density matrix: A basic combinatorial approach to separability of mixed states. Annals of Combinatorics, 10(3).

[21] Hamilton, W. L., Ying, R., & Leskovec, J. (2017). Inductive representation learning on large graphs. Advances in Neural Information Processing Systems, 30.

[22] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional Transformers for language understanding. NAACL-HLT 2019.

[23] Jain, S., & Wallace, B. C. (2019). Attention is not explanation. NAACL-HLT 2019.

[24] Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2).

[25] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692.

[26] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. (2018). Graph attention networks. ICLR 2018.

[27] Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. ICLR 2019.

[28] Kahneman, D. (2011). Thinking, fast and slow. Farrar, Straus and Giroux.

[29] Garcez, A. d'A., & Lamb, L. C. (2023). Neurosymbolic AI: The 3rd wave. Artificial Intelligence Review, 56(11).