File size: 7,082 Bytes
9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 a075c80 9083130 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
# BriLLM: Brain-inspired Large Language Model
We release BriLLM-Chinese and BriLLM-English.
Our paper: https://arxiv.org/pdf/2503.11299
Our Github: https://github.com/brillm05/BriLLM0.5
Our huggingface: https://huggingface.co/BriLLM/BriLLM0.5
## Overview
BriLLM redefines the foundations of generative language modeling by departing from Transformer architectures, GPT frameworks, and traditional input-output constrained paradigms. Built on the Signal Fully-connected flowing (SiFu) mechanism—a directed graph-based neural network design—BriLLM enables full interpretability across all nodes, in contrast to conventional models limited to input-output interpretability.
In this framework, tokens are represented as graph nodes, with signal flows—either randomly initialized or user-defined—propagating along paths following a "least resistance" principle. The next token to be generated emerges as the target of this signal flow. Theoretically, BriLLM supports infinitely long n-gram modeling, with model size decoupled from input and prediction length. Its signal propagation dynamics mimic human-like cognitive patterns, enabling recall activation and inherent multi-modal compatibility.

## SiFu Mechanism

The SiFu (Signal Fully-connected Flowing) mechanism addresses fundamental limitations of current machine learning frameworks. Unlike traditional models that process discrete input streams through opaque computations, SiFu operates on a fully connected directed graph where:
- Each node represents an interpretable unit (token, concept, etc.)
- Signal tensors propagate through the graph following energy dynamics
- The next token is determined by maximizing signal energy
- All nodes can serve as both input and output interfaces

Signal propagation follows the principle:
$v_i = \arg\max_{v'} \left\| r \oplus v_1 \otimes e_{12} \oplus v_2 \ldots \oplus v' \right\|$
where $\oplus$ and $\otimes$ denote tensor operations for node and edge interactions, and $\|\cdot\|$ represents signal energy.
Overall, SiFu's design as a directed fully connected graph with signal propagation confers two key advantages:
1. **Inherent full interpretability**: User-defined entities (concepts, tokens, or interpretable units) map directly to specific graph nodes;
2. **Unbounded contextual capacity**: Prediction is framed as signal propagation through node activations. Because signals propagate freely across nodes, sequence prediction naturally supports arbitrarily long contexts without increasing model size.
## Architecture

BriLLM implements the SiFu mechanism where each vocabulary token corresponds to a node defined by a GeLU-activated neuron layer with bias $b \in \mathbb{R}^{d_{node}}$. Edges between nodes are modeled as fully connected matrices $W_{u,v} \in \mathbb{R}^{d_{node} \times d_{node}}$, enabling bidirectional signaling.
Signal propagation begins with initial tensor $e_0 = [1, 1, \ldots, 1]^T \in \mathbb{R}^{d_{node}}$ and follows:
$e_{i+1} = \text{GeLU}(W_{u_i,u_{i+1}} e_i + b_{u_i,u_{i+1}} + PE_i)$
The final prediction maximizes the L2 norm: $v_{predict} = \arg\max_v \|E_{u,v}\|_2$
## Training Network

Training BriLLM involves constructing a dedicated neural network for each sequence sample. The network connects input nodes sequentially, with all potential paths integrated into a final softmax layer that identifies the correct path via cross-entropy loss optimization.
## Implementation Details
BriLLM is implemented using PyTorch. It uses sinusoidal positional encoding, GeLU as the activation function, cross-entropy loss for next-token prediction, and an embedding size of $d_{model} = 32$. We used the AdamW optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$. The model size is about $512 + 4000 * 4000 * (32 * 32 + 32) \approx 16B$. We trained our models on one machine with 8 NVIDIA A800 GPUs for 1.5k steps.

BriLLM leverages sparse token co-occurrence: most bigrams are low-frequency or absent, allowing shared parameters for inactive edges. Low-frequency bigrams use a fixed, non-updated matrix, reducing model size to 2B (Chinese) and 1B (English)—13.0\% and 5.7\% of the original size, respectively. This reduces parameters by ~90\% while accelerating training.

## Case Study
### Chinese Examples

### English Examples

## Comparison: Traditional LLMs vs BriLLM

## Installation
```bash
pip install torch
```
## Model Checkpoints
[BriLLM0.5](https://huggingface.co/BriLLM/BriLLM0.5)
## Training
### BriLLM-Chinese
```bash
bash run_zh.sh
```
### BriLLM-English
```bash
bash run_en.sh
```
## Inference
### BriLLM-Chinese
```python
import json
import torch
from model import BraLM, Vocab
with open("vocab_wiki_4k.json") as f:
node_dict = json.load(f)
vocab = Vocab.from_node_dict(node_dict)
with open('word_frequency.json', 'r') as f:
freq_dict = json.load(f)
zero_freq_edges = {}
for s in freq_dict:
zero_freq_edges[s] = []
for t in freq_dict[s]:
if freq_dict[s][t] == 0:
zero_freq_edges[s].append(t)
model = BraLM(hidden_size=32, zero_freq_edges=zero_freq_edges, vocab=vocab)
model.prepare_network(vocab)
state_dict = torch.load("model_zh.bin", weights_only=True)
model.load_state_dict(state_dict)
model.to_device("cuda:6")
head = "《罗马》描述了"
max_token = 32 - len(head)
start = [vocab((head[i]+ '->' +head[i+1])) for i in range(len(head)-1)]
ret = model.decode(start, vocab, max_token)
decode_tuple_list = [vocab.decode(p) for p in ret]
decode_sentence = decode_tuple_list[0][0] + "".join([p[-1] for p in decode_tuple_list])
print(decode_sentence)
```
### BriLLM-English
```python
import json
import torch
from model import BraLM, Vocab
from tokenizers import Tokenizer
bpe_tokenizer = Tokenizer.from_file("wiki_bpe_tokenizer_4000_bytelevel.json")
def decode_en_sentence(head, max_token=32, do_sample=False):
bpe_tokens = bpe_tokenizer.encode(head).tokens
if len(bpe_tokens) < 2:
return head
start = [vocab((bpe_tokens[i] + '->' + bpe_tokens[i+1])) for i in range(len(bpe_tokens)-1)]
ret = model.decode(start, vocab, max_token, do_sample)
decode_tuple_list = [vocab.decode(p).split('->') for p in ret]
decode_sentence = decode_tuple_list[0][0] + "".join([p[-1] for p in decode_tuple_list])
return decode_sentence
with open("./vocab_wiki_4k_en.json") as f:
node_dict = json.load(f)
vocab = Vocab.from_node_dict(node_dict)
model = BraLM(hidden_size=32)
model.prepare_network(vocab)
state_dict = torch.load("model_en.bin", weights_only=True)
model.load_state_dict(state_dict)
model.to_device("cuda:6")
head = "In frogs, the hind legs are larger"
encoding = bpe_tokenizer.encode(head)
token_len = len(encoding.ids)
max_token = 32 - token_len
decode_sentence = decode_en_sentence(head, max_token).replace("Ġ", " ")
print(decode_sentence)
``` |