Spaces:
Running
title: MAUVE
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
MAUVE is a measure of the gap between two text distributions, e.g., how far
the text written by a model is the distribution of human text, using samples
from both distributions.
MAUVE takes values between 0 (completely different distributions) and 1
(identical distributions).
MAUVE is obtained by computing Kullback–Leibler (KL) divergences divergences
between the to distributions in a quantized embedding space of a large
language model. It can quantify differences in the quality of generated text
based on the size of the model, decoding algorithm, and the length of the
generated text. MAUVE was found to correlate the strongest with human
evaluations over baseline metrics for open-ended text generation.
For details, see the MAUVE paper: https://arxiv.org/abs/2102.01454 (NeurIPS,
2021).
This metrics is a wrapper around the official implementation of MAUVE:
https://github.com/krishnap25/mauve
Metric Card for MAUVE
Metric description
MAUVE is a measure of the gap between neural text and human text. It is computed using the Kullback–Leibler (KL) divergences between the two distributions of text in a quantized embedding space of a large language model. MAUVE can identify differences in quality arising from model sizes and decoding algorithms.
This metric is a wrapper around the official implementation of MAUVE.
For more details, consult the MAUVE paper.
How to use
The metric takes two lists of strings of tokens separated by spaces: one representing predictions
(i.e. the text generated by the model) and the second representing references
(a reference text for each prediction):
from evaluate import load
mauve = load('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello world", "goodnight moon"]
mauve_results = mauve.compute(predictions=predictions, references=references)
It also has several optional arguments:
num_buckets
: the size of the histogram to quantize P and Q. Options: auto
(default) or an integer.
pca_max_data
: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. The default is -1
.
kmeans_explained_var
: amount of variance of the data to keep in dimensionality reduction by PCA. The default is 0.9
.
kmeans_num_redo
: number of times to redo k-means clustering (the best objective is kept). The default is 5
.
kmeans_max_iter
: maximum number of k-means iterations. The default is 500
.
featurize_model_name
: name of the model from which features are obtained, from one of the following: gpt2
, gpt2-medium
, gpt2-large
, gpt2-xl
. The default is gpt2-large
.
device_id
: Device for featurization. Supply a GPU id (e.g. 0
or 3
) to use GPU. If no GPU with this id is found, the metric will use CPU.
max_text_length
: maximum number of tokens to consider. The default is 1024
.
divergence_curve_discretization_size
Number of points to consider on the divergence curve. The default is 25
.
mauve_scaling_factor
: Hyperparameter for scaling. The default is 5
.
verbose
: If True
(default), running the metric will print running time updates.
seed
: random seed to initialize k-means cluster assignments, randomly assigned by default.
Output values
This metric outputs a dictionary with 5 key-value pairs:
mauve
: MAUVE score, which ranges between 0 and 1. Larger values indicate that P and Q are closer.
frontier_integral
: Frontier Integral, which ranges between 0 and 1. Smaller values indicate that P and Q are closer.
divergence_curve
: a numpy.ndarray of shape (m, 2); plot it with matplotlib
to view the divergence curve.
p_hist
: a discrete distribution, which is a quantized version of the text distribution p_text
.
q_hist
: same as above, but with q_text
.
Values from popular papers
The original MAUVE paper reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain (computed using 5000 continuations 1024-tokens long with default hyperparameters). The authors found that bigger models generally resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments.
Best practices
It is a good idea to use at least 500-1000 samples for each distribution to compute MAUVE.
MAUVE is unable to identify very small differences between different settings of generation (e.g., between top-p sampling with p=0.95 versus 0.96). It is important, therefore, to account for the randomness inside the generation (e.g., due to sampling) and within the MAUVE estimation procedure (see the seed
parameter above).
Therefore, it is a good idea to obtain generations using multiple random seeds and/or to use rerun MAUVE with multiple values of the parameter seed
.
Examples
Perfect match between prediction and reference:
from evaluate import load
mauve = load('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello world", "goodnight moon"]
mauve_results = mauve.compute(predictions=predictions, references=references)
print(mauve_results.mauve)
1.0
Partial match between prediction and reference:
from evaluate import load
mauve = load('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello there", "general kenobi"]
mauve_results = mauve.compute(predictions=predictions, references=references)
print(mauve_results.mauve)
0.27811372536724027
Limitations and bias
The original MAUVE paper did not analyze the inductive biases present in different embedding models, but related work has shown different kinds of biases exist in many popular generative language models including GPT-2 (see Kirk et al., 2021, Abid et al., 2021). The extent to which these biases can impact the MAUVE score has not been quantified.
Also, calculating the MAUVE metric involves downloading the model from which features are obtained -- the default model, gpt2-large
, takes over 3GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance gpt
is 523MB.
Citation
@inproceedings{pillutla-etal:mauve:neurips2021,
title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
booktitle = {NeurIPS},
year = {2021}
}