File size: 7,082 Bytes
9083130
 
 
 
 
 
a075c80
 
9083130
 
 
 
 
a075c80
 
 
 
 
 
 
 
9083130
 
 
a075c80
 
 
 
 
 
9083130
 
 
a075c80
 
 
 
 
 
 
 
 
 
 
9083130
 
 
a075c80
 
 
9083130
a075c80
9083130
a075c80
 
 
 
9083130
 
 
a075c80
9083130
 
a075c80
 
 
9083130
 
 
 
a075c80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9083130
 
 
a075c80
9083130
 
 
 
 
a075c80
 
9083130
 
 
a075c80
 
9083130
 
 
 
 
 
 
 
 
 
 
 
a075c80
9083130
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# BriLLM: Brain-inspired Large Language Model

We release BriLLM-Chinese and BriLLM-English.

Our paper: https://arxiv.org/pdf/2503.11299

Our Github: https://github.com/brillm05/BriLLM0.5

Our huggingface: https://huggingface.co/BriLLM/BriLLM0.5


## Overview

BriLLM redefines the foundations of generative language modeling by departing from Transformer architectures, GPT frameworks, and traditional input-output constrained paradigms. Built on the Signal Fully-connected flowing (SiFu) mechanism—a directed graph-based neural network design—BriLLM enables full interpretability across all nodes, in contrast to conventional models limited to input-output interpretability.

In this framework, tokens are represented as graph nodes, with signal flows—either randomly initialized or user-defined—propagating along paths following a "least resistance" principle. The next token to be generated emerges as the target of this signal flow. Theoretically, BriLLM supports infinitely long n-gram modeling, with model size decoupled from input and prediction length. Its signal propagation dynamics mimic human-like cognitive patterns, enabling recall activation and inherent multi-modal compatibility.

![](./figs/tab1.png)


## SiFu Mechanism

![](./figs/fig1.png)

The SiFu (Signal Fully-connected Flowing) mechanism addresses fundamental limitations of current machine learning frameworks. Unlike traditional models that process discrete input streams through opaque computations, SiFu operates on a fully connected directed graph where:

- Each node represents an interpretable unit (token, concept, etc.)
- Signal tensors propagate through the graph following energy dynamics
- The next token is determined by maximizing signal energy
- All nodes can serve as both input and output interfaces

![](./figs/fig2.png)

Signal propagation follows the principle:
$v_i = \arg\max_{v'} \left\| r \oplus v_1 \otimes e_{12} \oplus v_2 \ldots \oplus v' \right\|$

where $\oplus$ and $\otimes$ denote tensor operations for node and edge interactions, and $\|\cdot\|$ represents signal energy.

Overall, SiFu's design as a directed fully connected graph with signal propagation confers two key advantages:
1. **Inherent full interpretability**: User-defined entities (concepts, tokens, or interpretable units) map directly to specific graph nodes;
2. **Unbounded contextual capacity**: Prediction is framed as signal propagation through node activations. Because signals propagate freely across nodes, sequence prediction naturally supports arbitrarily long contexts without increasing model size. 


## Architecture

![](./figs/fig3.png)

BriLLM implements the SiFu mechanism where each vocabulary token corresponds to a node defined by a GeLU-activated neuron layer with bias $b \in \mathbb{R}^{d_{node}}$. Edges between nodes are modeled as fully connected matrices $W_{u,v} \in \mathbb{R}^{d_{node} \times d_{node}}$, enabling bidirectional signaling.

Signal propagation begins with initial tensor $e_0 = [1, 1, \ldots, 1]^T \in \mathbb{R}^{d_{node}}$ and follows:

$e_{i+1} = \text{GeLU}(W_{u_i,u_{i+1}} e_i + b_{u_i,u_{i+1}} + PE_i)$

The final prediction maximizes the L2 norm: $v_{predict} = \arg\max_v \|E_{u,v}\|_2$


## Training Network

![](./figs/fig4.png)

Training BriLLM involves constructing a dedicated neural network for each sequence sample. The network connects input nodes sequentially, with all potential paths integrated into a final softmax layer that identifies the correct path via cross-entropy loss optimization.


## Implementation Details

BriLLM is implemented using PyTorch. It uses sinusoidal positional encoding, GeLU as the activation function, cross-entropy loss for next-token prediction, and an embedding size of $d_{model} = 32$. We used the AdamW optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$. The model size is about $512 + 4000 * 4000 * (32 * 32 + 32) \approx 16B$. We trained our models on one machine with 8 NVIDIA A800 GPUs for 1.5k steps.

![](./figs/fig5.png)


BriLLM leverages sparse token co-occurrence: most bigrams are low-frequency or absent, allowing shared parameters for inactive edges. Low-frequency bigrams use a fixed, non-updated matrix, reducing model size to 2B (Chinese) and 1B (English)—13.0\% and 5.7\% of the original size, respectively. This reduces parameters by ~90\% while accelerating training.

![](./figs/tab2.png)


## Case Study

### Chinese Examples
![](./figs/tab3.png)


### English Examples  
![](./figs/tab4.png)


## Comparison: Traditional LLMs vs BriLLM

![](./figs/tab5.png)


## Installation

```bash
pip install torch
```


## Model Checkpoints

[BriLLM0.5](https://huggingface.co/BriLLM/BriLLM0.5)


## Training

### BriLLM-Chinese
```bash
bash run_zh.sh
```

### BriLLM-English
```bash
bash run_en.sh
```


## Inference

### BriLLM-Chinese
```python
import json
import torch
from model import BraLM, Vocab

with open("vocab_wiki_4k.json") as f:
     node_dict = json.load(f)
vocab = Vocab.from_node_dict(node_dict)

with open('word_frequency.json', 'r') as f:
    freq_dict = json.load(f)

zero_freq_edges = {}
for s in freq_dict:
    zero_freq_edges[s] = []
    for t in freq_dict[s]:
        if freq_dict[s][t] == 0:
            zero_freq_edges[s].append(t)

model = BraLM(hidden_size=32, zero_freq_edges=zero_freq_edges, vocab=vocab)
model.prepare_network(vocab)

state_dict = torch.load("model_zh.bin", weights_only=True)
model.load_state_dict(state_dict)
model.to_device("cuda:6")

head = "《罗马》描述了"
max_token = 32 - len(head)

start = [vocab((head[i]+ '->' +head[i+1])) for i in range(len(head)-1)]
ret = model.decode(start, vocab, max_token)
decode_tuple_list = [vocab.decode(p) for p in ret]
decode_sentence = decode_tuple_list[0][0] + "".join([p[-1] for p in decode_tuple_list])

print(decode_sentence)
```

### BriLLM-English
```python
import json
import torch
from model import BraLM, Vocab
from tokenizers import Tokenizer

bpe_tokenizer = Tokenizer.from_file("wiki_bpe_tokenizer_4000_bytelevel.json")

def decode_en_sentence(head, max_token=32, do_sample=False):
    bpe_tokens = bpe_tokenizer.encode(head).tokens
    if len(bpe_tokens) < 2:
        return head
    start = [vocab((bpe_tokens[i] + '->' + bpe_tokens[i+1])) for i in range(len(bpe_tokens)-1)]
    ret = model.decode(start, vocab, max_token, do_sample)
    decode_tuple_list = [vocab.decode(p).split('->') for p in ret]
    decode_sentence = decode_tuple_list[0][0] + "".join([p[-1] for p in decode_tuple_list])
    return decode_sentence

with open("./vocab_wiki_4k_en.json") as f:
     node_dict = json.load(f)
vocab = Vocab.from_node_dict(node_dict)

model = BraLM(hidden_size=32)
model.prepare_network(vocab)

state_dict = torch.load("model_en.bin", weights_only=True)
model.load_state_dict(state_dict)
model.to_device("cuda:6")

head = "In frogs, the hind legs are larger"
encoding = bpe_tokenizer.encode(head)
token_len = len(encoding.ids)
max_token = 32 - token_len
decode_sentence = decode_en_sentence(head, max_token).replace("Ġ", " ")

print(decode_sentence)
```