
The Mathematics We Use vs. the Mathematics We Ignore Artificial intelligence is almost universally described through optimization. Ask why Layer Normalization exists and you'll hear that it stabilizes gradients. Ask why Batch Normalization exists and you'll hear that it improves convergence. Residual connections help gradients flow, attention mechanisms compute weighted relationships between tokens, and feed-forward networks transform latent representations. None of these explanations are incorrect—they explain how neural networks train. What they rarely attempt to explain is what these operations represent. I think modal logic provides an interesting framework for studying neural networks because it shifts the discussion away from optimization and toward representation itself. Rather than asking how tensors change numerically, modal logic asks what information remains necessarily true as those tensors change. Suppose a layer produces x=[2,4,6].x=[2,4,6].x=[2,4,6]. The mean is μ=4,\mu=4,μ=4, the standard deviation is σ≈1.633,\sigma\approx1.633,σ≈1.633, and the normalized output becomes [−1.225, 0, 1.225].[-1.225,\;0,\;1.225].[−1.225,0,1.225]. Now suppose another layer instead produced x=[20,40,60].x=[20,40,60].x=[20,40,60]. Although every activation is ten times larger, the normalized result is identical: [−1.225, 0, 1.225].[-1.225,\;0,\;1.225].[−1.225,0,1.225]. This is where the modal interpretation becomes interesting. Traditional deep learning simply says the activations have been normalized. Modal logic instead asks why two numerically different states are treated as representing the same thing. The answer is that Layer Normalization has discarded information the network considers contingent—absolute magnitude—while preserving the relational structure between activations. The numerical values changed, but the underlying semantic relationship remained invariant. The modal version reaches the same conclusion from a different starting point. Instead of beginning with the tensor values themselves, it begins by asking which properties are necessary and which are merely possible. Take the two activation states: [ \n x=[2,4,6] \n ] and [ \n z=[20,40,60]. \n ] In ordinary tensor space, these are not the same object. The second vector is ten times larger than the first. But modal logic does not ask whether the numbers are identical. It asks whether the same structural truth holds across both possible states. In the first state, the middle value is centered between the lower and upper values: [ \n 4-2 = 6-4. \n ] In the second state, the same relationship remains true: [ \n 40-20 = 60-40. \n ] So the absolute magnitude changed, but the relational structure did not. Modal logic would treat these as two possible worlds in which the same necessary relation is preserved. Put simply: [ \n [2,4,6] \n ] and [ \n [20,40,60] \n ] are different numerical worlds, but they preserve the same necessary structure: [ \n \text{low} < \text{middle} < \text{high} \n ] with the middle value equally spaced between the other two. Layer Normalization reaches this same conclusion computationally by removing scale and centering the values around zero. Modal logic reaches it conceptually by identifying the invariant relationship that survives across both possible representations. The tensor approach says, “these normalize to the same vector.” The modal approach says, “these are different possible states of the same underlying structure.” Tensors Describe Values, Not Meaning Every neural network ultimately operates on tensors. Whether those tensors represent pixels, words, audio, or three-dimensional geometry, they are simply multidimensional arrays of numerical values. Linear algebra describes how those values are transformed as information moves through the network. However, anyone working with embeddings quickly discovers that identical concepts rarely correspond to identical tensors. Two vectors may differ significantly in magnitude while still representing almost the same semantic idea. Conversely, tensors that appear numerically similar can represent entirely different concepts depending on their surrounding context. The mathematics faithfully describes the transformation, but it says comparatively little about why different numerical states often preserve identical meaning. Layer Normalization Removes Numerical Possibilities Layer Normalization provides one of the clearest examples of this distinction. From an engineering perspective, LayerNorm subtracts the mean, divides by the standard deviation, and stabilizes optimization. Yet something more interesting is happening. Consider three tensors containing identical proportional relationships but completely different magnitudes. Before normalization they occupy three distinct numerical states. After normalization they become nearly identical representations. The absolute scale disappears while the relational structure remains intact. Through a modal lens, those different numerical configurations can be viewed as merely possible representations of the same underlying concept, while the normalized relationship becomes the necessary representation that survives regardless of numerical scale. LayerNorm therefore does more than stabilize gradients—it collapses many possible numerical worlds into one semantic world. Batch Processing Defines Context Rather Than Individual Samples Batch Normalization reveals another interesting perspective. Engineers typically describe BatchNorm as a statistical operation performed across an entire batch. While technically correct, this explanation understates its conceptual role. Every sample inside a batch derives part of its representation relative to every other sample present. A single image is no longer interpreted in isolation but as one member of a larger statistical universe. A photograph of a cat surrounded by hundreds of dog images acquires meaning partly through its difference from the surrounding examples. Batch processing therefore behaves less like independent evaluation and more like contextual reasoning, where every representation is influenced by the possible states defined by the remainder of the batch. Embeddings Behave More Like Concepts Than Coordinates The same pattern appears throughout embedding spaces. Engineers often describe embeddings as vectors occupying high-dimensional coordinate systems. While mathematically accurate, their practical behavior resembles conceptual neighborhoods rather than isolated numerical points. Multiple vectors may represent the same sentence despite substantial numerical variation, while semantically unrelated ideas become separated by surprisingly small geometric distances. What remains stable across these representations is not the tensor itself but the concept encoded within it. Modal logic naturally distinguishes between these two levels of description by separating representation from necessity. The tensor is one possible realization of an idea, while the semantic relationship is the property that persists regardless of which numerical realization produced it. Residual Connections Preserve Necessary Information Residual connections are almost always explained through gradient flow. They allow deeper networks to train by preserving information from earlier layers. From another perspective, however, they also preserve identity throughout the network. Each transformation proposes a new representation while the residual pathway carries forward information that has already proven useful. Rather than completely replacing one state with another, the network accumulates knowledge without discarding previously established structure. In modal terms, each layer explores new possibilities while the residual pathway preserves information that remains necessary throughout successive transformations. Hidden Representations as Possible Worlds This perspective becomes even more compelling when applied to hidden representations within modern neural networks. Consider a transformer layer containing a latent representation of 512 neurons. From a conventional engineering perspective, this activation vector is simply another point in a high-dimensional embedding space; a collection of numerical values that will be transformed by the next layer. A modal interpretation instead views this latent state as one possible representation, or "possible world," describing the model's current understanding of the input. As information propagates through successive layers, each transformation produces another possible world, refining, expanding, or constraining the previous interpretation rather than simply replacing it. Viewed through this lens, attention mechanisms become more than weighted matrix operations. Rather than merely computing similarities between queries, keys, and values, attention can be interpreted as evaluating multiple competing semantic possibilities before emphasizing the representations that best explain the surrounding context. Residual connections preserve information that remains invariant across these successive worlds, ensuring that useful structure is not discarded as new interpretations emerge. Feed-forward layers then generate increasingly abstract representations by constructing new latent possibilities while retaining the semantic relationships established earlier in the network. The result is a progression that resembles iterative reasoning over a space of possible representations rather than a simple sequence of numerical transformations. Linear algebra faithfully describes how each latent vector changes from one layer to the next, while a modal interpretation attempts to explain why many of those transformations preserve semantic identity despite substantial numerical change. Under this framework, the hidden layers of a transformer can be viewed not merely as successive tensor operations, but as a sequence of evolving representational worlds in which invariant concepts are progressively refined as the network converges toward its final prediction. Attention Explores Competing Semantic Worlds Attention mechanisms can also be viewed through this framework. Conventionally, attention computes relationships between queries, keys, and values before generating weighted combinations of information. Conceptually, however, every attention head evaluates multiple competing interpretations simultaneously. Words rarely possess fixed meaning independent of context. Instead, the model continually evaluates which interpretation best fits the surrounding sentence. Attention therefore resembles navigation through a landscape of possible semantic worlds. Each head explores different contextual possibilities before the network converges on the interpretation that best satisfies the available evidence. Feed-Forward Networks Continuously Refine Representation Feed-forward layers are often described as simple nonlinear transformations positioned after attention blocks. Yet these layers repeatedly reconstruct latent representations while preserving conceptual continuity. The tensor changes dramatically from one layer to the next, but the underlying idea frequently remains recognizable throughout the network. This observation helps explain why cosine similarity remains meaningful across different stages of representation despite substantial numerical differences. The network is not simply transforming tensors; it is repeatedly refining possible representations of the same concept until increasingly abstract semantic structure emerges. \n Why This Perspective Is Uncommon The reason modal logic rarely appears in AI research is largely historical rather than mathematical. Deep learning evolved primarily from statistics, optimization theory, signal processing, and numerical linear algebra. Modal logic developed within philosophy, formal reasoning, and model theory. These disciplines evolved independently and therefore ask fundamentally different questions. Deep learning asks how systems learn. Modal logic asks what must remain true across multiple possible representations. Neural networks increasingly appear to perform both tasks simultaneously, yet our theoretical vocabulary overwhelmingly emphasizes optimization while largely ignoring representation. \ Historical Context: When Two Worlds Meet Again Perhaps the greatest irony of this perspective is that it reunites two schools of artificial intelligence that spent decades arguing with one another. Throughout much of the twentieth century, AI research was divided between two fundamentally different philosophies. Symbolic AI, often referred to as Good Old-Fashioned AI (GOFAI), attempted to model intelligence through formal logic, explicit rules, semantic relationships, and symbolic reasoning. Knowledge was represented directly, and intelligence emerged from manipulating those symbolic structures. While elegant and mathematically rigorous, symbolic systems struggled to cope with ambiguity, uncertainty, and the enormous complexity of the real world. They were exceptionally good at reasoning once the rules were known, but remarkably poor at learning those rules from experience. Connectionist AI pursued the opposite philosophy. Rather than explicitly encoding knowledge, neural networks learned statistical representations directly from data. Intelligence emerged from millions or billions of numerical parameters optimized through gradient descent rather than handcrafted logical rules. This statistical approach ultimately gave rise to modern computer vision systems, transformers, and large language models, achieving levels of performance that symbolic AI never reached. The irony is that these two traditions may not be as incompatible as history suggests. This article proposes using modal logic a branch of formal logic traditionally associated with symbolic reasoning not to replace neural networks, but to help explain their internal representations. Rather than viewing tensors as nothing more than numerical objects, a modal interpretation treats them as successive possible representations whose invariant semantic relationships persist despite continual numerical transformation. In that sense, the tools developed within the symbolic tradition become useful for interpreting the successes of the connectionist tradition. Neural networks were trained through optimization, statistics, and linear algebra, yet many of their learned representations appear to preserve stable conceptual relationships remarkably similar to the kinds of structures symbolic AI sought to model explicitly. Whether this reflects a genuine convergence between the two paradigms or simply a useful interpretive framework remains an open research question. Either way, it suggests that the long-standing divide between symbolic reasoning and neural computation may be smaller than previously believed. If this perspective proves useful, the future of AI may not lie in choosing between symbolic and connectionist approaches, but in understanding how they complement one another. Statistics may explain how representations are learned, while formal logic provides a language for describing what those learned representations mean. Rather than competing theories of intelligence, they may represent two complementary mathematical perspectives on the same underlying phenomenon. [https://arxiv.org/html/2407.08516v1]() Implications for Future AI Architectures Viewing neural networks through modal logic does not replace linear algebra, nor does it invalidate existing optimization theory. Instead, it introduces another layer of interpretation. Tensor mathematics explains how representations evolve numerically. Modal logic may help explain why semantic structure remains remarkably stable despite enormous numerical variation. This distinction becomes increasingly relevant as researchers pursue more interpretable, context-aware, and memory-preserving AI systems. Much like my previous work on LLM Data Drift argued that language models suffer from prediction without persistent grounding, modal reasoning suggests that neural networks may already preserve semantic invariants implicitly without explicitly modeling them. If those invariants can be formalized, they may provide new ways to study representation learning, normalization, attention, and eventually entirely new neural architectures. Conclusion Modal logic should not be viewed as a replacement for linear algebra, probability, or optimization theory. Those mathematical foundations remain essential to understanding how neural networks compute, train, and converge. Instead, modal logic offers a complementary framework for interpreting what those computations represent. Rather than describing the numerical evolution of tensors alone, it provides a language for reasoning about invariants, possible representations, and the semantic relationships that persist as activations move through increasingly abstract layers of a network. One of the more interesting implications is that adopting a modal perspective does not necessarily require entirely new neural architectures. Existing models already perform operations that preserve some properties while discarding others. Layer Normalization removes absolute scale while preserving relational structure. Residual connections preserve information across successive transformations. Attention evaluates competing contextual interpretations before converging on a representation. Modal logic simply provides another mathematical vocabulary for describing these behaviors. As such, it may augment existing architectures rather than replace them, allowing researchers to analyze familiar components through an additional representational lens. This perspective may also encourage more diverse approaches to neural network design. Contemporary AI research is heavily grounded in statistics, optimization, and numerical methods, all of which have proven extraordinarily successful. However, formal logic has long been used to describe reasoning systems and underpins much of theoretical computer science. Exploring modal logic alongside existing statistical frameworks may therefore provide additional insight into why neural networks behave as they do, particularly as representations become increasingly abstract. Viewed this way, many phenomena that currently appear empirical may become easier to reason about conceptually. Fine-tuning can dramatically alter the behavior of computer vision models, large language models, and multimodal systems despite relatively modest parameter updates. Traditionally, these changes are measured statistically through shifts in weights and activations. A modal interpretation instead encourages us to ask how those updates reshape the space of possible representations, which semantic invariants are preserved, and which are discarded. That does not replace the underlying mathematics, but it may provide a more intuitive explanation for why comparatively small changes can produce disproportionately large behavioral differences. Whether modal logic ultimately proves useful as a practical design tool remains an open research question. At the very least, it offers an alternative way of thinking about representation learning, one that complements existing mathematical foundations rather than competing with them. If nothing else, it may help bridge the gap between the numerical mechanics of neural networks and the semantic structures they appear to learn, providing engineers with another framework and tool for understanding increasingly complex AI systems. \
View original source — Hacker Noon ↗


