Memorisation-Profiles
Artefacts for the paper "Causal Estimation of Memorisation Profiles" (Lesci et al., 2024)
- Paper • 2406.04327 • Published • 1
pietrolesci/pythia-deduped-stats-raw
Viewer • Updated • 14.9M • 1.78kNote This folder contains the model evaluations (or "stats") for each model size included in the study. This is the "raw" version where we have stats at the token level. We gathered these statistics "just in case" since the inference process was expensive. However, we provide the sequence-level statistics in the `pietrolesci/pythia-deduped-stats` dataset.
pietrolesci/pythia-deduped-stats
Viewer • Updated • 16.3M • 373Note This folder contains the model evaluations (or "stats") for each model size included in the study and already aggregated at the sequence level. Based on the "raw" version where we have stats at the token level (`pietrolesci/pythia-deduped-stats-raw`).
EleutherAI/pile-deduped-pythia-preshuffled
Updated • 226 • 4Note This is the dataset we used in our study and corresponds to the training set for the models reported below.
pietrolesci/pile-deduped-subset
Viewer • Updated • 16.3k • 47Note Sample from the Pile (`EleutherAI/pile-deduped-pythia-preshuffled`) used in the experiments. The unique sequence identified (`seq_idx`) is simply the order of the sequence in the Pile. The dataset is already tokenized.
pietrolesci/pile-validation
Viewer • Updated • 215k • 40Note The validation data used in our study. The Pythia suite does not have an official validation. However, we confirmed with the authors that the Pile validation split (this one) was not seen during training. It is still a bit confusing whether the Pile data can be released freely. Thus, we will remove this dataset if required.
pietrolesci/pythia-deduped-memorisation-profiles
Viewer • Updated • 2.13M • 56EleutherAI/pythia-70m-deduped
Text Generation • Updated • 111k • 25EleutherAI/pythia-160m-deduped
Text Generation • Updated • 38.6k • 3EleutherAI/pythia-410m-deduped
Text Generation • Updated • 16.2k • 20EleutherAI/pythia-1.4b-deduped
Text Generation • Updated • 11.5k • 19
EleutherAI/pythia-2.8b-deduped
Text Generation • Updated • 9.1k • 14Note This model size was ultimately not included in the analysis as we found what seems to be a potential mismatch between checkpoints and batch indices. Specifically, no instantaneous memorisation was detected, which is puzzling since even the 70M parameters model experiences it.
EleutherAI/pythia-6.9b-deduped
Text Generation • Updated • 7.51k • 8EleutherAI/pythia-12b-deduped
Text Generation • Updated • 6.6k • 51