evaluate-metric/perplexity · Different values observed wrt HF PPL documentation

nferruz

Oct 6, 2022

•

edited Oct 6, 2022

Dear all,

Thanks for your great work putting these tools together.

We're testing PPL for several sentences (biological ones in this case) on a GPT2-like model.

For the example sequence:

ex = ['1.1.1.2<sep><start>MSIKLAYAAKQPMTFSRRMPTYEAGAIAPWTALPENVAFIGLGNMGRSQLETDEDAENTIAIAISRDWSEGNWIDPQKPMVIHGFSRSGGEAPGTWEDGDWSYDEWKAICDVGVIATSNGWFEDEVVPVSIKAGVDMATLPELMKPIVYAPEDLIPLVQKGVDNDEA<end>']

We obtain the following PPL value:

perplexity = load("perplexity", module_type="metric", )
results = perplexity.compute(predictions=ex, model_id='nferruz/ZymCTRL')
results
{'perplexities': [2262.685302734375], 'mean_perplexity': 2262.685302734375}

However, when using the code from the HuggingFace documentation (adapted for one single sequence): https://huggingface.co/docs/transformers/perplexity, as seen below:

import torch
from tqdm import tqdm

max_length = model.config.n_positions
stride = 512
seq_len = len(output[0])

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = output[0][begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[ :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over input tokens.
        # Multiply it with trg_len to get the summation instead of average.
        # We will take average over all the tokens to get the true average
        # in the last step of this example.
        neg_log_likelihood = outputs.loss * trg_len

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).sum() / end_loc)

We obtain a value 3 orders of magnitude lower:

ppl
tensor(6.9366, device='cuda:0')

This value is in line with what we'd expect from the global PPL of the model.

We would appreciate any insights regarding these differences.
Thanks for your time.
Best,

PS: In case it helps, the input_ids in pt format for this sentence are:

output[0]
tensor([  7, 431,   7, 431,   7, 431,   8,   2,   3, 443, 449, 440, 441, 442,
        432, 455, 432, 432, 441, 447, 446, 443, 450, 437, 449, 448, 448, 443,
        446, 450, 455, 436, 432, 438, 432, 440, 432, 446, 453, 450, 432, 442,
        446, 436, 444, 452, 432, 437, 440, 438, 442, 438, 444, 443, 438, 448,
        449, 447, 442, 436, 450, 435, 436, 435, 432, 436, 444, 450, 440, 432,
        440, 432, 440, 449, 448, 435, 453, 449, 436, 438, 444, 453, 440, 435,
        446, 447, 441, 446, 443, 452, 440, 439, 438, 437, 449, 448, 449, 438,
        438, 436, 432, 446, 438, 450, 453, 436, 435, 438, 435, 453, 449, 455,
        435, 436, 453, 441, 432, 440, 434, 435, 452, 438, 452, 440, 432, 450,
        449, 444, 438, 453, 437, 436, 435, 436, 452, 452, 446, 452, 449, 440,
        441, 432, 438, 452, 435, 443, 432, 450, 442, 446, 436, 442, 443, 441,
        446, 440, 452, 455, 432, 446, 436, 435, 442, 440, 446, 442, 452, 447,
        441, 438, 452, 435, 444, 435, 436, 432,   4,  13, 431,   7, 431,  15,
          2,   3, 440, 448, 453, 442, 449, 432, 438, 438, 452, 435, 438, 452,
        446, 455, 441, 440, 440, 449, 452, 449, 442, 440, 438, 437, 432, 452,
        432, 449, 448, 453, 442, 446, 446, 449, 435, 453, 447, 452, 432, 437,
        432, 432, 442, 452, 437, 448, 432, 442, 446, 438, 435, 432, 449, 446,
        448, 440, 436, 438, 441, 446, 432, 452, 452, 452, 446, 452, 438, 436,
        432, 455, 443, 442], device='cuda:0')

lvwerra

Evaluate Metric org Oct 7, 2022

cc @Tristan who implemented this metric.

ipengx

Oct 22, 2022

I meet the same case. I think that may be the lm_head from the LM Model doesn't init, because the checkpoint 'gpt2' only contains the parameters for GPT2Model

jchwenger

Jun 18

Hi there,

Also having this question! How can you have, on this very page, an example for the wikitext dataset and a PPL of 576.76, whereas at the bottom of your own documentation, an example showing that on that same dataset the expected PPL ought to be around 19.44 (or 16.44, in the other set-up)?!