shivendrra
commited on
Commit
•
c00e383
1
Parent(s):
84969ed
Update README.md
Browse files
README.md
CHANGED
@@ -18,8 +18,10 @@ tags:
|
|
18 |
|
19 |
|
20 |
## Model Details
|
21 |
-
|
22 |
-
It
|
|
|
|
|
23 |
### Model Description
|
24 |
|
25 |
- **Developed by:** [Shivendra Singh](https://twitter.com/shivendrra_)
|
@@ -35,7 +37,7 @@ It also has one more BERT based model that has 47million parameters, also capabl
|
|
35 |
Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature. I'll add more functionalities which includes classification of dna, masked token generation, etc. Maybe even implement MOE techinque in future.
|
36 |
### Direct Use
|
37 |
|
38 |
-
Load the model and then can be used to generate new sequences, `max_length=512` for 2.5b model and `256` for
|
39 |
|
40 |
## Bias, Risks, and Limitations
|
41 |
|
@@ -55,34 +57,16 @@ model = AutoModel.from_pretrained("Shivendrra/enigma-1.5b")
|
|
55 |
from model import Transformer
|
56 |
model = Transformer(vocab_size=vocab_size)
|
57 |
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
idx_cond = idx[:, -self.block_size:]
|
69 |
-
logits, _ = self(idx_cond)
|
70 |
-
logits = logits[:, -1, :]
|
71 |
-
scaled_logits = logits / temperature
|
72 |
-
|
73 |
-
if top_k > 0:
|
74 |
-
scaled_logits = self._top_k_filtering(scaled_logits, top_k)
|
75 |
-
probs = F.softmax(scaled_logits, dim=-1)
|
76 |
-
sampled_idx = torch.multinomial(probs, num_samples=1)
|
77 |
-
generated_tokens.append(sampled_idx.item())
|
78 |
-
idx = torch.cat((idx, sampled_idx), dim=1)
|
79 |
-
return generated_tokens
|
80 |
-
|
81 |
-
def _top_k_filtering(self, logits, top_k):
|
82 |
-
values, indices = torch.topk(logits, top_k, dim=-1)
|
83 |
-
min_value = values[:, -1].unsqueeze(-1).expand_as(logits)
|
84 |
-
filtered_logits = torch.where(logits < min_value, torch.ones_like(logits) * -float('inf'), logits)
|
85 |
-
return filtered_logits
|
86 |
```
|
87 |
|
88 |
## Training Details
|
@@ -98,60 +82,6 @@ These models were trained to 3k-4k iterations, each. on ~500million letters of D
|
|
98 |
Try to train it yourself if possible. `enigma/TrainEnigma` file contains all necessary functions needed to train it, from scratch or pre-train.
|
99 |
#### Functions:
|
100 |
This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations.
|
101 |
-
|
102 |
-
```python
|
103 |
-
def get_batch(split):
|
104 |
-
# generate a small batch of data of inputs x and targets y
|
105 |
-
data = train_data if split == 'train' else val_data
|
106 |
-
ix = torch.randint(len(data) - block_size, (batch_size,))
|
107 |
-
x = torch.stack([data[i:i+block_size] for i in ix])
|
108 |
-
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
|
109 |
-
x, y = x.to(device), y.to(device)
|
110 |
-
|
111 |
-
return x, y
|
112 |
-
|
113 |
-
@torch.no_grad()
|
114 |
-
def estimate_loss():
|
115 |
-
out = {}
|
116 |
-
model.eval()
|
117 |
-
for split in ['train', 'val']:
|
118 |
-
losses = torch.zeros(eval_iters)
|
119 |
-
for k in range(eval_iters):
|
120 |
-
X, Y = get_batch(split)
|
121 |
-
logits, loss = model(X, Y)
|
122 |
-
losses[k] = loss.item()
|
123 |
-
out[split] = losses.mean()
|
124 |
-
model.train()
|
125 |
-
return out
|
126 |
-
|
127 |
-
from model import Transformer
|
128 |
-
model = Transformer(vocab_size=vocab_size)
|
129 |
-
m = model.to(device)
|
130 |
-
|
131 |
-
n_param = sum(p.numel() for p in m.parameters())/1e6
|
132 |
-
print(f"{n_param:.2f} million")
|
133 |
-
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
|
134 |
-
steps = []
|
135 |
-
train_losses = []
|
136 |
-
val_losses = []
|
137 |
-
|
138 |
-
for iter in range(max_iters):
|
139 |
-
if iter % eval_interval == 0 or iter == max_iters - 1:
|
140 |
-
losses = estimate_loss()
|
141 |
-
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
|
142 |
-
�� steps.append(iter)
|
143 |
-
train_losses.append(losses['train'])
|
144 |
-
val_losses.append(losses['val'])
|
145 |
-
|
146 |
-
xb, yb = get_batch('train')
|
147 |
-
logits, loss = model(xb, yb)
|
148 |
-
optimizer.zero_grad(set_to_none=True)
|
149 |
-
loss.backward()
|
150 |
-
optimizer.step()
|
151 |
-
|
152 |
-
torch.save(model.state_dict(), f'enigma_{n_param:.0f}m.pth')
|
153 |
-
```
|
154 |
-
|
155 |
#### Training Hyperparameters
|
156 |
|
157 |
Configurations are saved in the `enigma/config-enigma.json` file. Suitable for 2.5b model.
|
|
|
18 |
|
19 |
|
20 |
## Model Details
|
21 |
+
It's a 2.5b model trained on ~1billion individual letters of DNA, kinda like training a text-based model on per-character level instead of sub-word level.
|
22 |
+
It does have it's own tokenizer similar that is intersection b/w char-level and bpe-tokenizer.
|
23 |
+
|
24 |
+
For EnBERT i.e. decoder-only model is trained on lot's of sequences of DNA tokenized using k-mer tokenizer specially trained for this purpose, which means it has a larger vocab size than the enigma-2.5b. Same model architecture is used in training a 430m model that is per-char based same as 2.5b model, but it's better than that.
|
25 |
### Model Description
|
26 |
|
27 |
- **Developed by:** [Shivendra Singh](https://twitter.com/shivendrra_)
|
|
|
37 |
Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature. I'll add more functionalities which includes classification of dna, masked token generation, etc. Maybe even implement MOE techinque in future.
|
38 |
### Direct Use
|
39 |
|
40 |
+
Load the model and then can be used to generate new sequences, `max_length=512` for 2.5b model and `256` for enigma-430m model.
|
41 |
|
42 |
## Bias, Risks, and Limitations
|
43 |
|
|
|
57 |
from model import Transformer
|
58 |
model = Transformer(vocab_size=vocab_size)
|
59 |
|
60 |
+
from tokenizer import PerCharTokenizer
|
61 |
+
token = PerCharTokenizer()
|
62 |
+
|
63 |
+
input = "TGCCCTGGCTGCTCCGCATTGCAGGAGCTGCGCCCTTCCTTTC"
|
64 |
+
token_input = token.encode(input)
|
65 |
+
context = torch.tensor([token_input], dtype=torch.long, device=device)
|
66 |
+
generated_output = token.decode(m.generate(context, max_new_tokens=500)[0].tolist())
|
67 |
+
print(generated_output)
|
68 |
+
|
69 |
+
model.generate()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
```
|
71 |
|
72 |
## Training Details
|
|
|
82 |
Try to train it yourself if possible. `enigma/TrainEnigma` file contains all necessary functions needed to train it, from scratch or pre-train.
|
83 |
#### Functions:
|
84 |
This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
85 |
#### Training Hyperparameters
|
86 |
|
87 |
Configurations are saved in the `enigma/config-enigma.json` file. Suitable for 2.5b model.
|