Beyond the Monolith: A Case for Constructive, Layer-wise Learning in LLMs
How does an LLM understand the meaning of 'wRiTe' when its building blocks—the individual character tokens 'w', 'R', 'i' - have no semantic content? This simple question challenges the very foundation of modern AI. It became the starting point for two research papers (https://arxiv.org/abs/2507.04886, https://arxiv.org/abs/2507.07129) and a proposal for a more structured, efficient, and natural way to build these complex systems.
This work argues that high-level meaning is not contained in token embeddings, but is constructed by the Transformer architecture. To test this, we introduced a radical constraint: replacing standard trainable embeddings with a completely frozen layer derived from the raw visual structure of Unicode glyphs. The results were intriguing. Not only did the models converge, but they showed a surprising advantage on reasoning benchmarks against architecturally identical models.
A Note on Scale and Intent
Before we go further, a critical clarification is needed. These experiments were conducted on a small scale (tiny dataset 9B tokens), primarily on a single GPU (with occasional boosts for data parallelism), and were never intended to beat the SOTA performance of giant, monolithically trained models. The observed two-fold advantage on MMLU at low scores (~25%) is likely a local effect, and the primary value is not in the specific numbers.
The true goal was to use resource constraints as a scientific tool. By removing the brute-force option of massive scale, we were forced to focus on architectural first principles. The key finding is not that a small model can beat a large one, but that a more intelligent learning paradigm shows a clear, positive signal even with its hands tied.
A Blueprint for Constructive AI
This "constructive" paradigm, built on a frozen substrate, offers practical solutions to some of the most pressing problems in AI today:
A Universal 'Docking Port' for LEGO-like Models: A frozen, Unicode-based embedding layer acts as a universal standard. This can solve the cold start problem and, more powerfully, enable us to merge specialist models post-training like LEGO bricks.
'Growing' Knowledge to Tame Complexity: The current paradigm of training trillion-parameter models is an optimization nightmare, requiring datasets of unimaginable size to constrain an exponentially growing space of solutions. We demonstrate a more manageable alternative: growing a model layer-by-layer (or by N layers at a time). This provides a structured curriculum, where each new layer builds upon a stable, competent foundation.
Learning from Nature’s Playbook
Nature, through billions of years of evolution, has never produced a fully-formed complex brain out of nothing. Intelligence grows. Neurogenesis proceeds step-by-step, building complexity upon existing structures. There are no known biological analogues to our current method of training LLMs: initializing a trillion-parameter random network and hoping it converges correctly in one monolithic process.
Our work suggests we should take a cue from nature. Instead of trying to freeze an entire lake at once—a chaotic process requiring immense energy—we should let a solid crust of ice form first. The deep freeze can then proceed layer by layer, building a stable, monolithic structure from a simple foundation. This is how deep learning can tame complexity.
A Broader Shift in Thinking
This line of thinking doesn't exist in a vacuum. A growing consensus in the research community suggests the future of AI scaling cannot be monolithic. Recent explorations into concepts, which creates efficiently nested representations, or the move towards compositional architectures show this clearly. These approaches aim to build intelligence from smaller, reusable, and more efficient components. Our work offers a practical, tested methodology with a universal "docking port" - the frozen embedding layer - to contribute directly to this emerging and vital trend.
The Path Forward
The current race for scale has a foreseeable end. Data is finite. Power is finite. The monolithic approach will hit a hard ceiling.
We believe the path forward must be constructive. Even if our specific implementation of frozen visual embeddings is not the final answer, it points towards a necessary shift in thinking. The future of AI lies in modularity, in compositional systems, in methods that allow us to build complexity intelligently, not just with brute force. It is the principle of the locomotive: you don't need infinite power to move an impossibly heavy train. You just need to conquer the inertia one car at a time.
This work is a proof-of-concept, a blueprint, and an invitation to the community to explore this path with us. Let's start building AI more constructively.