Model Card for Model ID

import torch
from torch import nn
import random
from transformers import AutoModelForCausalLM, AutoTokenizer

# make sure the model doesn't generate mask tokens
bias = torch.zeros(34048)
bias[32000:] = -100
model.lm_head.bias = nn.Parameter(bias)

# --------------------------------------------------------------------------------
# Generation without masking
input_ids = tokenizer("Once upon a time, in a land far far away...", return_tensors='pt').input_ids
print(input_ids)
# tensor([[    1,  5713,  3714,   264,   727, 28725,   297,   264,  2533,  2082,
#          2082,  1753,  1101]])
output = model.generate(input_ids, max_new_tokens=64)
print(tokenizer.decode(output[0]))
# '<s> Once upon a time, in a land far far away...\n\nThere was a magical place called Disneyland.\n\nIt was a place where dreams came true, where fairy tales became reality, and where magic was all around.\n\nBut one day, something terrible happened.\n\nThe magic began to fade.\n\nThe fairy tales became dull, the'

# --------------------------------------------------------------------------------
# replace "far far" with two random indices instead (anything after 32k up to 34,048)
# the model should pick up that two repeating words after "Once upon a time, in a land-"
# and before "away" would probably be "far far"

input_ids[input_ids==2082] = 32_001
print(input_ids)
# tensor([[    1,  5713,  3714,   264,   727, 28725,   297,   264,  2533, 32001,
#         32001,  1753,  1101]])
output = model.generate(input_ids, max_new_tokens=64)
print(tokenizer.decode(output[0]))
# '<s> Once upon a time, in a land<ID-000001><ID-000001> away...\n\nOnce upon a time, in a land far, far away, there was a magical kingdom called Flanders. It was a peaceful land, where everyone lived happily ever after.\n\nBut one day, a terrible thing happened. A terrible, terrible thing.\n\nA terrible, terrible thing happened.'

# --------------------------------------------------------------------------------
# we can also get rid of everything except "<s>", "Once", "upon", "away", "..."
def create_masked_ids(input_ids, token_offset, ids_to_mask):
    unique_ids = torch.unique(input_ids).tolist()
    unique_id_map = random.sample([i for i in range(2048)], len(unique_ids))

    id_to_shuffled = {id: shuffled for id, shuffled in zip(unique_ids, unique_id_map)}

    def map_to_shuffled(id):
        return id_to_shuffled[id] + token_offset

    shuffled_ids = input_ids.clone().apply_(map_to_shuffled)

    mask = torch.zeros_like(input_ids, dtype=torch.bool)
    for id_to_mask in ids_to_mask:
        mask |= (input_ids == id_to_mask)

    masked_ids = torch.where(mask, input_ids, shuffled_ids)

    return masked_ids

masked_ids = create_masked_ids(input_ids, 32_000, [1, 5713, 3714, 1753, 1101])
print(masked_ids)
# tensor([[    1,  5713,  3714, 33048, 34032, 32238, 32016, 33048, 33013, 33299,
#         33299,  1753,  1101]])

output = model.generate(masked_ids, max_new_tokens=64)
print(tokenizer.decode(output[0]))
# '<s> Once upon<ID-000418><ID-0007F0><ID-0000EE><ID-000010><ID-000418><ID-0003F5><ID-000513><ID-000513> away...\n\nOnce upon a time, there was a young man named Alex. He was a very curious young man, and loved to explore the world around him. One day, he stumbled upon a magical book called "The Book of Secrets." This book contained all sorts of secrets about the world, and Alex was fasc'

this model isn't really made for benchmarks, it's worse on everything besides ARC-C and TruthfulQA

Model	ARC-C	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8k
mistralai/Mistral-7B-v0.1	59.98	83.31	64.16	42.15	78.37	37.83
crumb/92d52f-ame-full-7B	61.18	81.52	63.44	42.39	77.58	35.41

it's got extra tokens which can all equally be used as masks, you can replace all instances of one token in context with one of the extra tokens ([f'<ID-{i:06X}>' for i in range(2048)]) to give the model an extra hard time. it was trained with context length 2048 on three separate replacement techniques through a schedule, with 80% of all sequences being completely replaced with the mask tokens near the end of training. it was trained over ~0.5B tokens

what? how is that useful?

i'm hoping to finetune it further while replacing the entire tokenizer with any number of other tokenizers, all utilizing the unique mask ids, to hopefully build a causal model of any sufficiently long artifact from any domain, for example, the voynich manuscript or an alien artifact

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: [More Information Needed]
Funded by [optional]: [More Information Needed]
Shared by [optional]: [More Information Needed]
Model type: [More Information Needed]
Language(s) (NLP): [More Information Needed]
License: [More Information Needed]
Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

Direct Use

[More Information Needed]

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]